ALB Returning 502 Bad Gateway: Diagnosing Application Response and Protocol Failures
An ALB returning 502 Bad Gateway while target group instances show 'Healthy' is one of the most disorienting mismatches in AWS load balancer operations.
TL;DR: ALB 502 Bad Gateway Diagnostic Summary
| Step | What to Do | Root Cause Addressed | Layer |
|---|---|---|---|
| 1 | Inspect ALB access logs for error_reason field | Identifies exact 502 sub-cause from ALB perspective | ALB |
| 2 | Verify target protocol and port match application listener | Protocol mismatch between ALB and target | Target Group |
| 3 | Check keep-alive and idle timeout alignment | Premature connection closure by application | Application |
| 4 | Validate HTTP response format from application | Malformed response line or headers | Application |
| 5 | Review Security Group rules on target instances | ALB health check passes but request traffic blocked | Network |
Why ALB 502 Bad Gateway Occurs Even When Targets Are Healthy
A 502 from an ALB means the load balancer successfully routed the request to a registered target but received an invalid or no response. The health check and the actual request path are evaluated differently — a target can pass health checks (typically a lightweight HTTP GET) while failing to handle real application traffic correctly. The fix lives at the application response layer, not the health check configuration.
Understanding the ALB 502 Failure Path
Before diagnosing, it helps to understand exactly where in the request lifecycle the ALB generates a 502. The ALB establishes a TCP connection to the target, sends the HTTP request, and then waits for a valid HTTP response. If the target closes the connection prematurely, returns a malformed HTTP response, or violates the expected protocol, the ALB emits a 502 to the client — regardless of what the health check reported.
- Client → ALB: Client sends an HTTP/HTTPS request. ALB terminates TLS if configured.
- ALB → Target: ALB opens a connection to the target on the configured protocol and port.
- Target response evaluated: ALB parses the HTTP response. Any parse failure, premature close, or protocol violation triggers a 502.
- Health check path (separate): Health checks run independently on a configured interval. A passing health check only confirms the target accepted a TCP connection and returned the expected HTTP status on the health check path — not that it handles all request types correctly.
Think of the health check as a smoke detector and the actual request as a fire inspection. The detector confirms the building is standing; the inspection reveals whether the wiring is safe under load.
Step 1: Read the ALB Access Log error_reason Field
The error_reason field in ALB access logs is the single most important diagnostic signal for 502 errors. It is populated only for error responses and contains a machine-readable string identifying the failure cause from the ALB's perspective.
— Why this step: The 502 symptom alone is ambiguous across five distinct root causes; the error_reason field narrows the failure to a specific layer before any other investigation is warranted.
First, confirm access logging is enabled on your ALB and identify the S3 bucket:
aws elbv2 describe-load-balancer-attributes \
--load-balancer-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:loadbalancer/app/my-alb/1234567890abcdef \
--query 'Attributes[?Key==`access_logs.s3.enabled` || Key==`access_logs.s3.bucket`]'
Once you have the bucket name, retrieve recent logs and search for 502 entries:
aws s3 cp s3://my-alb-logs-bucket/AWSLogs/123456789012/elasticloadbalancing/us-east-1/2024/01/15/ . \
--recursive --exclude "*" --include "*.log.gz"
zcat *.log.gz | awk '$9 == 502 {print $0}' | head -50
The error_reason field is the 25th space-delimited field in the ALB access log format. Common values that map directly to 502 causes include RESPONSE_INVALID, HANDSHAKE_TIMEOUT, and CONNECTION_ERROR. Consult the ALB access log documentation for the full list of documented error_reason values.
Step 2: Verify Target Group Protocol and Port Configuration
A protocol mismatch is a frequent source of 502s that health checks mask entirely. Health checks often use HTTP on port 80, while the application may have been reconfigured to serve HTTPS on 443 — or vice versa. The ALB connects using the target group's registered protocol and port, not the listener protocol.
— Why this step: The health check path and the forwarding path use the same target group protocol setting, but a misconfigured target group silently sends HTTP to an HTTPS-only application, producing a connection-level failure the health check never exercises.
In practice, teams often configure an HTTPS listener on the ALB and assume the ALB will automatically negotiate TLS with the backend. When the target group protocol is set to HTTP, the ALB sends plaintext to the target. If the application only accepts TLS, it closes the connection immediately — producing a 502 with no application-level error log on the instance.
aws elbv2 describe-target-groups \
--target-group-arns arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-tg/abcdef1234567890 \
--query 'TargetGroups[*].{Protocol:Protocol,Port:Port,HealthCheckProtocol:HealthCheckProtocol,HealthCheckPort:HealthCheckPort}'
Confirm the Protocol and Port values match what the application process is actually listening on. Verify on the instance directly:
ss -tlnp | grep LISTEN
Step 3: Check Keep-Alive Timeout Alignment
This is the most operationally subtle 502 cause and the one most likely to appear intermittently under load. The ALB has a configurable idle timeout (default 60 seconds). If the application's keep-alive timeout is set lower than the ALB's idle timeout, the application closes the connection while the ALB still considers it valid and attempts to reuse it for a new request. The ALB receives a RST or empty response and returns a 502.
— Why this step: Intermittent 502s that appear only under moderate-to-high traffic leave no application-level error log, so connection-layer timing is the only remaining diagnostic variable.
- ALB idle timeout: The ALB keeps a connection to the target open for up to the configured idle timeout period (default 60 seconds).
- App keep-alive expires first: If the application's keep-alive timeout is shorter, it closes the connection before the ALB's timer expires.
- ALB reuses stale connection: On the next request, the ALB attempts to send over the now-closed connection.
- 502 emitted: The target returns RST or nothing; the ALB returns 502 to the client.
Check the ALB's current idle timeout:
aws elbv2 describe-load-balancer-attributes \
--load-balancer-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:loadbalancer/app/my-alb/1234567890abcdef \
--query 'Attributes[?Key==`idle_timeout.timeout_seconds`]'
The application's keep-alive timeout must be set higher than the ALB's idle timeout. For example, if the ALB idle timeout is 60 seconds, configure the application keep-alive to at least 65 seconds. For nginx, this is the keepalive_timeout directive. For Node.js HTTP servers, this is server.keepAliveTimeout. Consult your application server documentation for the exact parameter.
Step 4: Validate HTTP Response Format from the Application
The ALB strictly parses HTTP/1.1 responses. If the application emits a response with a malformed status line, missing required headers, or an invalid HTTP version string, the ALB cannot parse it and returns a 502 with error_reason set to RESPONSE_INVALID.
— Why this step: Application frameworks occasionally emit non-standard responses under error conditions (e.g., a custom error handler that writes raw text before the HTTP status line), which the ALB rejects even though the application considers the response successful.
Reproduce the exact request the ALB sends to the target and inspect the raw response. From within the VPC (or directly on the instance), use curl with verbose output to capture the full response including headers:
curl -v --http1.1 http://10.0.1.25:8080/api/endpoint \
-H 'Host: www.example.com'
Look for:
- The status line must begin with
HTTP/1.1followed by a 3-digit status code and reason phrase. - Headers must be separated from the body by a blank line (
\r\n\r\n). - If
Transfer-Encoding: chunkedis present, the body must be properly chunked. - The
Content-Lengthheader, if present, must match the actual body length.
Step 5: Review Security Group Rules on Target Instances
Health checks and forwarded requests originate from the same ALB nodes, but they may use different ports. If the Security Group on the target instance allows the health check port but not the application port, health checks pass while real requests are dropped — producing a 502 from the ALB's perspective.
— Why this step: Security Group rules are evaluated per port and protocol; a rule allowing TCP 80 for health checks does not implicitly allow TCP 8080 for application traffic, and the ALB receives no response rather than a TCP rejection it can distinguish from an application error.
aws ec2 describe-security-groups \
--group-ids sg-0123456789abcdef0 \
--query 'SecurityGroups[*].IpPermissions'
Confirm that the inbound rules on the target's Security Group allow traffic from the ALB's Security Group on the application port (the port registered in the target group), not only the health check port. The source should reference the ALB's Security Group ID, not a CIDR range, to follow least-privilege principles:
aws ec2 authorize-security-group-ingress \
--group-id sg-0123456789abcdef0 \
--protocol tcp \
--port 8080 \
--source-group sg-alb0123456789abcdef
IAM Policy for ALB Log Access
Diagnostic steps 1 and 2 require reading ALB access logs from S3 and describing load balancer attributes. Attach the following least-privilege policy to the IAM principal performing the investigation:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ALBDescribeAttributes",
"Effect": "Allow",
"Action": [
"elasticloadbalancing:DescribeLoadBalancerAttributes",
"elasticloadbalancing:DescribeTargetGroups",
"elasticloadbalancing:DescribeTargetHealth"
],
"Resource": "*"
},
{
"Sid": "ALBLogReadAccess",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::my-alb-logs-bucket",
"arn:aws:s3:::my-alb-logs-bucket/*"
]
},
{
"Sid": "EC2SGDescribe",
"Effect": "Allow",
"Action": [
"ec2:DescribeSecurityGroups"
],
"Resource": "*"
}
]
}
Note: elasticloadbalancing:DescribeLoadBalancerAttributes, DescribeTargetGroups, and ec2:DescribeSecurityGroups do not support resource-level restrictions and require "Resource": "*" per the AWS Service Authorization Reference.
Glossary
- 502 Bad Gateway
- An HTTP status code returned by the ALB when it receives an invalid response (or no response) from a registered target after successfully establishing a connection.
- error_reason
- A field in ALB access logs that provides a machine-readable string identifying the specific cause of an ALB-generated error response. Only populated for error responses.
- Keep-Alive Timeout
- The duration a persistent HTTP connection is held open waiting for additional requests. Must be configured higher on the application than the ALB's idle timeout to prevent premature connection closure.
- ALB Idle Timeout
- A configurable ALB attribute (default 60 seconds) defining how long the ALB keeps a connection to a target open when no data is being transferred.
- Target Group Protocol
- The protocol (HTTP or HTTPS) the ALB uses when forwarding requests to registered targets. Distinct from the listener protocol and the health check protocol.
- Health Check
- A periodic probe the ALB sends to registered targets to determine their availability. A passing health check confirms the target accepted a connection and returned the expected status on the configured path — not that all request types are handled correctly.
Diagnosing ALB 502 Bad Gateway: Wrap-Up
The core insight is that ALB 502 errors and target health status operate on independent evaluation paths. A healthy target confirms TCP reachability and a valid response on one specific path — it says nothing about protocol correctness, keep-alive alignment, or response format under real traffic patterns. Start with the error_reason field in access logs, then work down through protocol configuration, connection timing, response format, and Security Group rules in that order. Each layer is independently verifiable with the CLI commands above, and each maps to a distinct, documented ALB failure mode.
For related operational context, review AWS documentation on ALB troubleshooting and the ALB access log reference for the complete list of documented error_reason values.
Comments
Post a Comment