ASG Health Checks: Why Your Instances Are Being Terminated and When to Switch from EC2 to ELB
Your Auto Scaling Group is terminating instances that appear perfectly healthy from an OS perspective — the VM is running, SSH works, but ASG keeps replacing them. This is one of the most common and costly misconfigurations in production AWS environments, and the root cause almost always comes down to a single decision: which health check type your ASG is using.
TL;DR
| Dimension | EC2 Health Check | ELB Health Check |
|---|---|---|
| What it checks | Hypervisor-level instance state (running, stopped, terminated) and system status checks | Application-level HTTP/TCP response from the load balancer's perspective |
| Default in ASG | Yes | No — must be explicitly enabled |
| Marks unhealthy when | Instance state ≠ running, or system/instance status check fails |
Target Group health check returns unhealthy (e.g., non-2xx/3xx HTTP response, timeout) |
| Blind spot | App crashes, port not listening, 500 errors — instance still looks "healthy" | Requires an ALB/NLB Target Group to be attached to the ASG |
| Use when | No load balancer, batch workloads, or non-HTTP services | Web/API workloads behind ALB/NLB where app responsiveness matters |
| Grace period applies | Yes | Yes — critical to configure correctly |
Understanding the Root Cause: Two Different Definitions of "Healthy"
AWS ASG health checks operate at two fundamentally different layers. Confusing them is the source of most unexpected termination events.
EC2 Health Check (Default)
The EC2 health check asks one question: Is the virtual machine itself alive? It evaluates two signals from the EC2 service:
- Instance State: Must be
running. If the instance isstopping,stopped,shutting-down, orterminated, ASG marks it unhealthy. - EC2 Status Checks: The hypervisor-level system status check and the instance-level status check. These detect hardware failures, network misconfiguration at the host level, and kernel panics.
Critically, EC2 health checks have zero visibility into your application. A Node.js process that crashed, a Java app stuck in an infinite GC loop returning 503s, or a misconfigured Nginx — all invisible to EC2 health checks. The instance is "running," so ASG considers it healthy.
ELB Health Check
When you configure ELB health checks on an ASG, the ASG defers to the health status reported by the Target Group associated with your ALB or NLB. The load balancer actively probes each registered target (your EC2 instance) on a configured path, port, and protocol at a defined interval. If the target fails the configured threshold of consecutive checks, the Target Group marks it unhealthy — and ASG acts on that signal to terminate and replace the instance.
Analogy: EC2 health checks are like a hospital checking if a patient has a heartbeat. ELB health checks are like checking if the patient can actually hold a conversation and respond to questions. A heartbeat alone doesn't mean the patient is functional.
The Health Check Decision Flow
- Instance Launch: ASG launches a new EC2 instance and starts the health check grace period timer.
- Grace Period Active: During this window, ASG ignores all health check failures. This is your application's boot time budget — if set too low, ASG will terminate instances before they finish starting.
- EC2 Check (Always Active): After the grace period, ASG always evaluates the EC2-level status. A terminated or stopped instance is always unhealthy regardless of other settings.
- ELB Check (Conditional): If ELB health check type is enabled on the ASG AND a Target Group is attached, ASG also evaluates the Target Group's health status for the instance.
- Unhealthy Decision: If either check marks the instance unhealthy (after grace period), ASG terminates it and launches a replacement.
- Healthy Path: Instance serves traffic normally.
Why Your Instances Are Being Terminated: Common Scenarios
Scenario 1: Grace Period Too Short
This is the #1 cause of "healthy instances being terminated." Your application takes 3 minutes to start (JVM warmup, database connection pool initialization, cache priming), but your HealthCheckGracePeriod is set to 60 seconds. ASG starts checking at 60s, the app isn't ready yet, ELB marks it unhealthy, and ASG terminates it — creating a termination loop.
Fix: Set the grace period to comfortably exceed your application's worst-case startup time. Measure it: time from instance launch to first successful health check response.
Scenario 2: Using EC2 Health Check but App is Broken
Your app is returning 500s or the process has crashed, but the EC2 instance is still in running state. EC2 health checks pass. Traffic hits the instance, users get errors. ASG never terminates it because from EC2's perspective, everything is fine. This is the opposite problem — you want ELB health checks here.
Scenario 3: ELB Health Check Misconfigured
You've switched to ELB health checks, but the health check path (/health) returns a 404, or the security group on the instance blocks the load balancer's health check probes. Every instance gets marked unhealthy immediately, causing a termination storm.
Architecture: EC2 vs ELB Health Check Signal Flow
- EC2 Health Signal Path (top): The EC2 service directly reports instance state and status check results to the ASG. No application layer involvement.
- ELB Health Signal Path (bottom): The ALB/NLB Target Group actively sends HTTP/TCP probes to the instance on the configured health check port and path. The result (healthy/unhealthy) is reported to the ASG.
- ASG Decision Engine: Aggregates both signals. Either signal reporting unhealthy (post grace period) triggers the replace workflow.
- Replace Workflow: ASG terminates the unhealthy instance and launches a new one, respecting the configured termination policy.
How to Enable ELB Health Checks on an Existing ASG
Prerequisites
- An ALB or NLB with a Target Group already configured.
- The Target Group must be attached to the ASG (either at creation or via
attach-load-balancer-target-groups). - A valid health check endpoint on your application (e.g.,
/healthreturning HTTP 200). - Security group on instances must allow inbound traffic from the load balancer's security group on the health check port.
Option 1: AWS CLI
# Update health check type and grace period on existing ASG
aws autoscaling update-auto-scaling-group \
--auto-scaling-group-name my-production-asg \
--health-check-type ELB \
--health-check-grace-period 300
# Verify the change
aws autoscaling describe-auto-scaling-groups \
--auto-scaling-group-names my-production-asg \
--query 'AutoScalingGroups[0].{HealthCheckType:HealthCheckType,GracePeriod:HealthCheckGracePeriod}'
Option 2: CloudFormation
🔽 [Click to expand] CloudFormation ASG Resource with ELB Health Check
AWSTemplateFormatVersion: '2010-09-09'
Resources:
MyAutoScalingGroup:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
AutoScalingGroupName: my-production-asg
MinSize: '2'
MaxSize: '10'
DesiredCapacity: '2'
# Attach to Target Group
TargetGroupARNs:
- !Ref MyTargetGroup
# Use ELB health check type
HealthCheckType: ELB
# Grace period in seconds — adjust to your app startup time
HealthCheckGracePeriod: 300
LaunchTemplate:
LaunchTemplateId: !Ref MyLaunchTemplate
Version: !GetAtt MyLaunchTemplate.LatestVersionNumber
VPCZoneIdentifier:
- !Ref PrivateSubnet1
- !Ref PrivateSubnet2
MyTargetGroup:
Type: AWS::ElasticLoadBalancingV2::TargetGroup
Properties:
Name: my-app-tg
Port: 8080
Protocol: HTTP
VpcId: !Ref MyVPC
TargetType: instance
# Health check configuration
HealthCheckEnabled: true
HealthCheckPath: /health
HealthCheckProtocol: HTTP
HealthCheckPort: '8080'
HealthCheckIntervalSeconds: 30
HealthCheckTimeoutSeconds: 5
HealthyThresholdCount: 2
UnhealthyThresholdCount: 3
Matcher:
HttpCode: '200'
Option 3: Terraform
🔽 [Click to expand] Terraform ASG with ELB Health Check
resource "aws_autoscaling_group" "main" {
name = "my-production-asg"
min_size = 2
max_size = 10
desired_capacity = 2
vpc_zone_identifier = [aws_subnet.private_1.id, aws_subnet.private_2.id]
# Attach to Target Group
target_group_arns = [aws_lb_target_group.main.arn]
# Switch to ELB health check
health_check_type = "ELB"
# Adjust to your application's startup time
health_check_grace_period = 300
launch_template {
id = aws_launch_template.main.id
version = "$Latest"
}
}
resource "aws_lb_target_group" "main" {
name = "my-app-tg"
port = 8080
protocol = "HTTP"
vpc_id = aws_vpc.main.id
target_type = "instance"
health_check {
enabled = true
path = "/health"
protocol = "HTTP"
port = "8080"
interval = 30
timeout = 5
healthy_threshold = 2
unhealthy_threshold = 3
matcher = "200"
}
}
Diagnosing Unexpected Terminations
Before changing health check type, diagnose why instances are being marked unhealthy. Use this checklist:
Step 1: Check ASG Activity History
aws autoscaling describe-scaling-activities \
--auto-scaling-group-name my-production-asg \
--max-items 20 \
--query 'Activities[?StatusCode==`Failed` || contains(Description, `unhealthy`)].[ActivityId,Description,StatusMessage,StartTime]' \
--output table
Step 2: Check Target Group Health for Specific Instances
# Get Target Group ARN first
TG_ARN=$(aws elbv2 describe-target-groups \
--names my-app-tg \
--query 'TargetGroups[0].TargetGroupArn' \
--output text)
# Check health of all targets
aws elbv2 describe-target-health \
--target-group-arn $TG_ARN \
--query 'TargetHealthDescriptions[*].{Instance:Target.Id,Port:Target.Port,State:TargetHealth.State,Reason:TargetHealth.Reason,Description:TargetHealth.Description}' \
--output table
Step 3: Check EC2 Status Checks
aws ec2 describe-instance-status \
--instance-ids i-0123456789abcdef0 \
--query 'InstanceStatuses[0].{InstanceState:InstanceState.Name,SystemStatus:SystemStatus.Status,InstanceStatus:InstanceStatus.Status}'
Health Check Grace Period: Calculating the Right Value
The grace period must cover the total time from instance launch to application ready. Add a 20-30% buffer for variance. Components to measure:
- EC2 instance boot time: Time for OS to boot and reach
runningstate. - User Data script execution: Package installs, config pulls from S3/Parameter Store.
- Application startup time: JVM initialization, Spring context load, DB connection pool warmup.
- First successful health check response: Time until
/healthreturns 200.
Security Considerations
When enabling ELB health checks, ensure your security group configuration follows least privilege:
# The instance security group must allow inbound from the ALB security group
# on the health check port — NOT from 0.0.0.0/0
# Example: Allow ALB SG to reach instances on port 8080
aws ec2 authorize-security-group-ingress \
--group-id sg-instance-security-group-id \
--protocol tcp \
--port 8080 \
--source-group sg-alb-security-group-id \
--description "Allow ALB health checks and traffic"
Do not open the health check port to 0.0.0.0/0. Reference the ALB's security group ID as the source. This ensures only your load balancer can reach the application port directly.
Decision Guide: Which Health Check Type Should You Use?
Wrap-up & Next Steps
Switching from EC2 to ELB health checks is the right move for any web or API workload behind a load balancer — but it's only effective when paired with a correctly configured grace period, a reliable health check endpoint, and proper security group rules. Blindly switching without these prerequisites will cause a different failure mode: a termination storm where every instance is immediately marked unhealthy.
Action items:
- Audit your current ASG health check type:
aws autoscaling describe-auto-scaling-groups --query 'AutoScalingGroups[*].{Name:AutoScalingGroupName,HealthCheckType:HealthCheckType,GracePeriod:HealthCheckGracePeriod}' - Measure your application's actual startup time before setting the grace period.
- Implement a dedicated
/healthendpoint that validates critical dependencies (DB connectivity, cache availability) — not just HTTP 200. - Review AWS documentation: Health checks for Auto Scaling instances.
Glossary
| Term | Definition |
|---|---|
| Health Check Grace Period | A configurable time window (in seconds) after an instance launches during which ASG ignores health check failures. Prevents premature termination during application startup. |
| EC2 Status Checks | Automated hypervisor-level checks performed by AWS that evaluate the reachability and functionality of the underlying hardware and the EC2 instance OS kernel. |
| Target Group Health Check | Active probes sent by an ALB or NLB to registered targets to determine if they can receive and respond to traffic. Configured per Target Group with path, protocol, thresholds, and interval. |
| Unhealthy Threshold | The number of consecutive failed health checks required before a Target Group marks an instance as unhealthy. Prevents flapping from transient failures. |
| Termination Policy | The strategy ASG uses to select which instance to terminate during scale-in events (e.g., OldestInstance, Default). Separate from health-check-driven terminations. |
Comments
Post a Comment