ALB 502 Bad Gateway: Why Healthy Targets Still Fail and How to Fix It
An ALB returning 502 errors while the target group dashboard shows all instances as Healthy is one of the most disorienting mismatches in AWS networking — the health check passes, but real traffic fails. This post dissects every verified root cause at the application and protocol layer, with CLI-first diagnostics.
TL;DR — Root Causes & Fixes
| Root Cause | Signal | Fix |
|---|---|---|
| App crashes on real request (not health check path) | 502 only on specific routes | Fix application error on that endpoint |
| Invalid HTTP response (missing status line / malformed headers) | 502 immediately, all routes | Ensure app returns valid HTTP/1.1 response |
| Keep-alive mismatch / connection reset by target | Intermittent 502 under load | Set keep-alive timeout on app > ALB idle timeout |
| Response header too large (>16 KB) | 502 on responses with large cookies/JWTs | Reduce header size; strip unnecessary cookies |
| Target closes connection before sending full response | 502 with partial body | Increase app write timeout; check OOM kills |
| HTTP/2 protocol error from backend | 502 when target group protocol is HTTP/2 | Verify backend supports HTTP/2 or switch to HTTP/1.1 |
| gRPC trailer missing | 502 on gRPC target groups | Ensure gRPC server sends trailing metadata |
Why the Health Check Lies to You
The ALB health check is a synthetic probe — it hits a single, lightweight path (e.g., /health) on a fixed interval. It only validates TCP connectivity and an HTTP status code from that one endpoint. It tells you nothing about:
- Whether the application handles other routes correctly.
- Whether the HTTP response format is valid for real payloads.
- Whether the connection lifecycle (keep-alive, timeouts) is correct under concurrent load.
Analogy: Think of the health check like a fire alarm test — pressing the test button confirms the alarm circuit works, but it doesn't tell you whether the sprinkler pipes are blocked. The ALB health check confirms the app can answer a knock on the door; it doesn't verify the app can hold a full conversation.
The ALB Request Lifecycle: Where 502 Is Generated
A 502 is generated by the ALB itself when it receives an invalid or no response from the target. Understanding the exact phase where the failure occurs is the key to diagnosis.
Deep-Dive: Every Verified Root Cause
Phase 1 — Connection Layer
1. Keep-Alive Timeout Race Condition (Most Common Intermittent 502)
The ALB reuses persistent TCP connections to targets for efficiency. If the target's keep-alive timeout is shorter than the ALB's idle timeout, the target closes the connection just as the ALB is reusing it — the ALB sends a request on a dead socket and gets a TCP RST back, producing a 502.
- ALB default idle timeout: 60 seconds.
- Fix: Set your application's keep-alive timeout to at least 65 seconds (5s buffer above ALB idle timeout).
Nginx example:
# nginx.conf
keepalive_timeout 65s;
Node.js (http/https server) example:
const server = app.listen(3000);
server.keepAliveTimeout = 65000; // ms
server.headersTimeout = 66000; // must be > keepAliveTimeout
Check the current ALB idle timeout and modify it via CLI:
# Get current idle timeout
aws elbv2 describe-load-balancer-attributes \
--load-balancer-arn <ALB_ARN> \
--query 'Attributes[?Key==`idle_timeout.timeout_seconds`]'
# Set idle timeout to 60s (default)
aws elbv2 modify-load-balancer-attributes \
--load-balancer-arn <ALB_ARN> \
--attributes Key=idle_timeout.timeout_seconds,Value=60
2. Target Closes Connection Mid-Response
The application starts sending a response but closes the socket before the body is complete — caused by OOM kills, application panics, or write timeouts shorter than the response generation time.
- Check for OOM kills:
dmesg | grep -i 'killed process' - Check application logs for panics or unhandled exceptions during the request lifecycle.
Phase 2 — HTTP Protocol Layer
3. Malformed HTTP Response
The ALB strictly validates the HTTP response from the target. Any of the following will produce a 502:
- Missing or invalid HTTP status line (e.g., response starts with body bytes, not
HTTP/1.1 200 OK\r\n). - Headers not terminated with
\r\n\r\n. - Invalid
Content-Lengththat doesn't match actual body size. - Chunked encoding with a malformed chunk size.
4. Response Header Size Exceeds 16 KB
AWS ALB has a hard limit of 16 KB for response headers. Responses with large Set-Cookie headers, oversized JWTs in headers, or many custom headers will be rejected with a 502.
- Diagnose by capturing the raw response from the target directly (bypassing ALB):
curl -v http://<instance-private-ip>:<port>/<path> - Measure header size:
curl -sI http://<target> | wc -c
5. HTTP/2 Protocol Errors on Backend
If the target group protocol is set to HTTP/2, the backend must fully implement HTTP/2 including proper SETTINGS frames and GOAWAY handling. A backend that speaks HTTP/1.1 only will cause immediate 502s.
# Check target group protocol version
aws elbv2 describe-target-groups \
--target-group-arns <TG_ARN> \
--query 'TargetGroups[*].{Protocol:Protocol,ProtocolVersion:ProtocolVersion}'
# Downgrade to HTTP/1.1 if backend doesn't support HTTP/2
aws elbv2 modify-target-group-attributes \
--target-group-arn <TG_ARN> \
--attributes Key=protocol_version,Value=HTTP1
Phase 3 — Application Logic Layer
6. Application Crashes on Non-Health-Check Routes
The health check hits /health and returns 200. A real request to /api/v1/data triggers a code path that throws an unhandled exception, and the process either crashes or returns garbage. The ALB sees no valid response and emits 502.
- Test the failing route directly on the instance to isolate from ALB:
curl -v http://<private-ip>:<port>/api/v1/data - Tail application logs during a failing request:
journalctl -u myapp -f
Diagnostic Flow: Finding the Exact Cause
Enabling ALB Access Logs (Critical for Diagnosis)
The error_reason field in ALB access logs is the single most valuable diagnostic signal. Without it, you are guessing. Enable it immediately.
# Enable access logs to S3
aws elbv2 modify-load-balancer-attributes \
--load-balancer-arn <ALB_ARN> \
--attributes \
Key=access_logs.s3.enabled,Value=true \
Key=access_logs.s3.bucket,Value=my-alb-logs-bucket \
Key=access_logs.s3.prefix,Value=my-alb
Once logs are flowing, filter for 502s and extract the error_reason field (field 25 in the log format):
# Download and parse logs (adjust date/path)
aws s3 cp s3://my-alb-logs-bucket/my-alb/AWSLogs/<account>/elasticloadbalancing/<region>/<date>/ . \
--recursive
# Extract 502 error reasons
zcat *.log.gz | awk '$9 == "502" {print $25}' | sort | uniq -c | sort -rn
Key error_reason values and their meaning:
| error_reason | Meaning |
|---|---|
Target.ResponseCodeMismatch | Target returned a non-2xx/3xx that ALB couldn't forward |
Target.Timeout | Target didn't respond within the timeout window |
Target.ConnectionError | TCP connection to target failed or was reset |
Target.InvalidResponse | Malformed HTTP response from target |
Target.FailedHealthChecks | Target became unhealthy during request processing |
IAM Permissions Required
Apply least-privilege. The minimum IAM permissions needed for the diagnostic and fix operations above:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"elasticloadbalancing:DescribeLoadBalancers",
"elasticloadbalancing:DescribeLoadBalancerAttributes",
"elasticloadbalancing:ModifyLoadBalancerAttributes",
"elasticloadbalancing:DescribeTargetGroups",
"elasticloadbalancing:ModifyTargetGroupAttributes",
"elasticloadbalancing:DescribeTargetHealth"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::my-alb-logs-bucket",
"arn:aws:s3:::my-alb-logs-bucket/*"
]
}
]
}
Also ensure the S3 bucket policy grants s3:PutObject to the ALB service principal (elasticloadbalancing.amazonaws.com) for the log delivery to work.
Cost Impact
- ALB Access Logs: Stored in S3 — you pay standard S3 PUT and storage costs. For a high-traffic ALB, logs can reach several GB/day. Use S3 Lifecycle rules to expire logs after 7–30 days to control cost.
- No additional ALB cost for enabling logs — the feature itself is free; only S3 storage is billed.
- Reducing 502s directly reduces wasted LCU (Load Balancer Capacity Unit) consumption from failed requests, which can lower your ALB hourly cost under high error-rate scenarios.
Glossary
- 502 Bad Gateway: An HTTP status code emitted by a proxy (the ALB) when it receives an invalid, incomplete, or no response from the upstream server (the target).
- Keep-Alive Timeout: The duration a server holds an idle TCP connection open for reuse; must exceed the ALB idle timeout to prevent race-condition 502s.
- ALB Idle Timeout: The maximum time the ALB waits for data on a connection before closing it; defaults to 60 seconds and is configurable.
- error_reason: A field in ALB access logs that provides the machine-readable cause of a 4xx/5xx response generated by the load balancer itself.
- Protocol Version: The HTTP protocol variant (HTTP/1.1, HTTP/2, gRPC) negotiated between the ALB and the target group; a mismatch is a direct cause of 502 errors.
Comments
Post a Comment