ASG Health Checks: Why Your Instances Are Being Terminated and When to Switch from EC2 to ELB

Your Auto Scaling Group is terminating instances that appear perfectly healthy from an OS perspective — the VM is running, SSH works, but ASG keeps replacing them. This is one of the most common and costly misconfigurations in production AWS environments, and the root cause almost always comes down to a single decision: which health check type your ASG is using.

TL;DR

Dimension	EC2 Health Check	ELB Health Check
What it checks	Hypervisor-level instance state (running, stopped, terminated) and system status checks	Application-level HTTP/TCP response from the load balancer's perspective
Default in ASG	Yes	No — must be explicitly enabled
Marks unhealthy when	Instance state ≠ `running`, or system/instance status check fails	Target Group health check returns unhealthy (e.g., non-2xx/3xx HTTP response, timeout)
Blind spot	App crashes, port not listening, 500 errors — instance still looks "healthy"	Requires an ALB/NLB Target Group to be attached to the ASG
Use when	No load balancer, batch workloads, or non-HTTP services	Web/API workloads behind ALB/NLB where app responsiveness matters
Grace period applies	Yes	Yes — critical to configure correctly

Understanding the Root Cause: Two Different Definitions of "Healthy"

AWS ASG health checks operate at two fundamentally different layers. Confusing them is the source of most unexpected termination events.

EC2 Health Check (Default)

The EC2 health check asks one question: Is the virtual machine itself alive? It evaluates two signals from the EC2 service:

Instance State: Must be running. If the instance is stopping, stopped, shutting-down, or terminated, ASG marks it unhealthy.
EC2 Status Checks: The hypervisor-level system status check and the instance-level status check. These detect hardware failures, network misconfiguration at the host level, and kernel panics.

Critically, EC2 health checks have zero visibility into your application. A Node.js process that crashed, a Java app stuck in an infinite GC loop returning 503s, or a misconfigured Nginx — all invisible to EC2 health checks. The instance is "running," so ASG considers it healthy.

ELB Health Check

When you configure ELB health checks on an ASG, the ASG defers to the health status reported by the Target Group associated with your ALB or NLB. The load balancer actively probes each registered target (your EC2 instance) on a configured path, port, and protocol at a defined interval. If the target fails the configured threshold of consecutive checks, the Target Group marks it unhealthy — and ASG acts on that signal to terminate and replace the instance.

Analogy: EC2 health checks are like a hospital checking if a patient has a heartbeat. ELB health checks are like checking if the patient can actually hold a conversation and respond to questions. A heartbeat alone doesn't mean the patient is functional.

The Health Check Decision Flow

flowchart TD A["Instance Launched"] --> B["Grace Period Active (HealthCheckGracePeriod)"]; B --> C{"Grace Period Expired?"}; C -- "No" --> B; C -- "Yes" --> D["EC2 Health Check (Always Active)"]; D --> E{"EC2 Status Healthy?"}; E -- "No" --> I["Mark Unhealthy"]; E -- "Yes" --> F{"ELB Health Check Enabled on ASG?"}; F -- "No" --> H["Mark Healthy ✅"]; F -- "Yes" --> G{"Target Group Reports Healthy?"}; G -- "Yes" --> H; G -- "No" --> I; I --> J["ASG Terminates Instance"]; J --> K["ASG Launches Replacement"]; H --> L["Instance Serves Traffic"]; style A fill:#2d6a4f,color:#fff style H fill:#2d6a4f,color:#fff style I fill:#c1121f,color:#fff style J fill:#c1121f,color:#fff style K fill:#e9c46a,color:#000

Instance Launch: ASG launches a new EC2 instance and starts the health check grace period timer.
Grace Period Active: During this window, ASG ignores all health check failures. This is your application's boot time budget — if set too low, ASG will terminate instances before they finish starting.
EC2 Check (Always Active): After the grace period, ASG always evaluates the EC2-level status. A terminated or stopped instance is always unhealthy regardless of other settings.
ELB Check (Conditional): If ELB health check type is enabled on the ASG AND a Target Group is attached, ASG also evaluates the Target Group's health status for the instance.
Unhealthy Decision: If either check marks the instance unhealthy (after grace period), ASG terminates it and launches a replacement.
Healthy Path: Instance serves traffic normally.

Why Your Instances Are Being Terminated: Common Scenarios

Scenario 1: Grace Period Too Short

This is the #1 cause of "healthy instances being terminated." Your application takes 3 minutes to start (JVM warmup, database connection pool initialization, cache priming), but your HealthCheckGracePeriod is set to 60 seconds. ASG starts checking at 60s, the app isn't ready yet, ELB marks it unhealthy, and ASG terminates it — creating a termination loop.

Fix: Set the grace period to comfortably exceed your application's worst-case startup time. Measure it: time from instance launch to first successful health check response.

Scenario 2: Using EC2 Health Check but App is Broken

Your app is returning 500s or the process has crashed, but the EC2 instance is still in running state. EC2 health checks pass. Traffic hits the instance, users get errors. ASG never terminates it because from EC2's perspective, everything is fine. This is the opposite problem — you want ELB health checks here.

Scenario 3: ELB Health Check Misconfigured

You've switched to ELB health checks, but the health check path (/health) returns a 404, or the security group on the instance blocks the load balancer's health check probes. Every instance gets marked unhealthy immediately, causing a termination storm.

Architecture: EC2 vs ELB Health Check Signal Flow

graph LR subgraph EC2_Path ["EC2 Health Check Signal Path"] direction LR HV["AWS Hypervisor / EC2 Service"] -->|"Instance State + Status Checks"| ASG_EC2["ASG Health Evaluator"] end subgraph ELB_Path ["ELB Health Check Signal Path"] direction LR ALB["ALB / NLB"] -->|"HTTP/TCP Probe (e.g., GET /health)"| INST["EC2 Instance :8080"] INST -->|"HTTP 200 OK or Failure"| TG["Target Group Health Status"] TG -->|"Healthy / Unhealthy"| ASG_ELB["ASG Health Evaluator"] end ASG_EC2 --> DECISION{"Healthy?"} ASG_ELB --> DECISION DECISION -->|"Yes (both healthy)"| SERVE["Serve Traffic ✅"] DECISION -->|"No (either unhealthy)"| REPLACE["Terminate & Replace Instance 🔄"] style EC2_Path fill:#e8f4f8,stroke:#0077b6 style ELB_Path fill:#fff3e0,stroke:#e07b00 style REPLACE fill:#c1121f,color:#fff style SERVE fill:#2d6a4f,color:#fff

EC2 Health Signal Path (top): The EC2 service directly reports instance state and status check results to the ASG. No application layer involvement.
ELB Health Signal Path (bottom): The ALB/NLB Target Group actively sends HTTP/TCP probes to the instance on the configured health check port and path. The result (healthy/unhealthy) is reported to the ASG.
ASG Decision Engine: Aggregates both signals. Either signal reporting unhealthy (post grace period) triggers the replace workflow.
Replace Workflow: ASG terminates the unhealthy instance and launches a new one, respecting the configured termination policy.

How to Enable ELB Health Checks on an Existing ASG

Prerequisites

An ALB or NLB with a Target Group already configured.
The Target Group must be attached to the ASG (either at creation or via attach-load-balancer-target-groups).
A valid health check endpoint on your application (e.g., /health returning HTTP 200).
Security group on instances must allow inbound traffic from the load balancer's security group on the health check port.

Option 1: AWS CLI

# Update health check type and grace period on existing ASG
aws autoscaling update-auto-scaling-group \
  --auto-scaling-group-name my-production-asg \
  --health-check-type ELB \
  --health-check-grace-period 300

# Verify the change
aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names my-production-asg \
  --query 'AutoScalingGroups[0].{HealthCheckType:HealthCheckType,GracePeriod:HealthCheckGracePeriod}'

Option 2: CloudFormation

🔽 [Click to expand] CloudFormation ASG Resource with ELB Health Check

AWSTemplateFormatVersion: '2010-09-09'
Resources:
  MyAutoScalingGroup:
    Type: AWS::AutoScaling::AutoScalingGroup
    Properties:
      AutoScalingGroupName: my-production-asg
      MinSize: '2'
      MaxSize: '10'
      DesiredCapacity: '2'
      # Attach to Target Group
      TargetGroupARNs:
        - !Ref MyTargetGroup
      # Use ELB health check type
      HealthCheckType: ELB
      # Grace period in seconds — adjust to your app startup time
      HealthCheckGracePeriod: 300
      LaunchTemplate:
        LaunchTemplateId: !Ref MyLaunchTemplate
        Version: !GetAtt MyLaunchTemplate.LatestVersionNumber
      VPCZoneIdentifier:
        - !Ref PrivateSubnet1
        - !Ref PrivateSubnet2

  MyTargetGroup:
    Type: AWS::ElasticLoadBalancingV2::TargetGroup
    Properties:
      Name: my-app-tg
      Port: 8080
      Protocol: HTTP
      VpcId: !Ref MyVPC
      TargetType: instance
      # Health check configuration
      HealthCheckEnabled: true
      HealthCheckPath: /health
      HealthCheckProtocol: HTTP
      HealthCheckPort: '8080'
      HealthCheckIntervalSeconds: 30
      HealthCheckTimeoutSeconds: 5
      HealthyThresholdCount: 2
      UnhealthyThresholdCount: 3
      Matcher:
        HttpCode: '200'

Option 3: Terraform

🔽 [Click to expand] Terraform ASG with ELB Health Check

resource "aws_autoscaling_group" "main" {
  name                      = "my-production-asg"
  min_size                  = 2
  max_size                  = 10
  desired_capacity          = 2
  vpc_zone_identifier       = [aws_subnet.private_1.id, aws_subnet.private_2.id]

  # Attach to Target Group
  target_group_arns = [aws_lb_target_group.main.arn]

  # Switch to ELB health check
  health_check_type         = "ELB"
  # Adjust to your application's startup time
  health_check_grace_period = 300

  launch_template {
    id      = aws_launch_template.main.id
    version = "$Latest"
  }
}

resource "aws_lb_target_group" "main" {
  name        = "my-app-tg"
  port        = 8080
  protocol    = "HTTP"
  vpc_id      = aws_vpc.main.id
  target_type = "instance"

  health_check {
    enabled             = true
    path                = "/health"
    protocol            = "HTTP"
    port                = "8080"
    interval            = 30
    timeout             = 5
    healthy_threshold   = 2
    unhealthy_threshold = 3
    matcher             = "200"
  }
}

Diagnosing Unexpected Terminations

Before changing health check type, diagnose why instances are being marked unhealthy. Use this checklist:

Step 1: Check ASG Activity History

aws autoscaling describe-scaling-activities \
  --auto-scaling-group-name my-production-asg \
  --max-items 20 \
  --query 'Activities[?StatusCode==`Failed` || contains(Description, `unhealthy`)].[ActivityId,Description,StatusMessage,StartTime]' \
  --output table

Step 2: Check Target Group Health for Specific Instances

# Get Target Group ARN first
TG_ARN=$(aws elbv2 describe-target-groups \
  --names my-app-tg \
  --query 'TargetGroups[0].TargetGroupArn' \
  --output text)

# Check health of all targets
aws elbv2 describe-target-health \
  --target-group-arn $TG_ARN \
  --query 'TargetHealthDescriptions[*].{Instance:Target.Id,Port:Target.Port,State:TargetHealth.State,Reason:TargetHealth.Reason,Description:TargetHealth.Description}' \
  --output table

Step 3: Check EC2 Status Checks

aws ec2 describe-instance-status \
  --instance-ids i-0123456789abcdef0 \
  --query 'InstanceStatuses[0].{InstanceState:InstanceState.Name,SystemStatus:SystemStatus.Status,InstanceStatus:InstanceStatus.Status}'

Health Check Grace Period: Calculating the Right Value

gantt title Health Check Grace Period Components dateFormat s axisFormat %Ss section Instance Lifecycle EC2 Boot (OS Start) : a1, 0, 30s User Data Script Execution : a2, after a1, 60s Application Process Start : a3, after a2, 90s App Ready (First 200 OK) : milestone, after a3, 0s section Grace Period Recommended Grace Period (300s buffer) : crit, 0, 300s

The grace period must cover the total time from instance launch to application ready. Add a 20-30% buffer for variance. Components to measure:

EC2 instance boot time: Time for OS to boot and reach running state.
User Data script execution: Package installs, config pulls from S3/Parameter Store.
Application startup time: JVM initialization, Spring context load, DB connection pool warmup.
First successful health check response: Time until /health returns 200.

Security Considerations

When enabling ELB health checks, ensure your security group configuration follows least privilege:

# The instance security group must allow inbound from the ALB security group
# on the health check port — NOT from 0.0.0.0/0

# Example: Allow ALB SG to reach instances on port 8080
aws ec2 authorize-security-group-ingress \
  --group-id sg-instance-security-group-id \
  --protocol tcp \
  --port 8080 \
  --source-group sg-alb-security-group-id \
  --description "Allow ALB health checks and traffic"

Do not open the health check port to 0.0.0.0/0. Reference the ALB's security group ID as the source. This ensures only your load balancer can reach the application port directly.

Decision Guide: Which Health Check Type Should You Use?

flowchart TD START(["What type of workload is your ASG running?"]) --> Q1{"Is there an ALB or NLB Target Group attached?"}; Q1 -- "No" --> EC2_ONLY["Use EC2 Health Check (Default)"]; Q1 -- "Yes" --> Q2{"Does your app serve HTTP/HTTPS traffic?"}; Q2 -- "No (TCP/UDP service)" --> Q3{"Does NLB Target Group have health checks configured?"}; Q3 -- "Yes" --> ELB_CHOICE["Use ELB Health Check"]; Q3 -- "No" --> EC2_ONLY; Q2 -- "Yes" --> Q4{"Do you have a reliable /health endpoint?"}; Q4 -- "No" --> ACTION1["⚠️ Build /health endpoint first, then enable ELB health check"]; Q4 -- "Yes" --> Q5{"Is grace period set to ≥ app startup time?"}; Q5 -- "No" --> ACTION2["⚠️ Measure startup time, set grace period, then enable"]; Q5 -- "Yes" --> ELB_CHOICE; style ELB_CHOICE fill:#2d6a4f,color:#fff style EC2_ONLY fill:#0077b6,color:#fff style ACTION1 fill:#e9c46a,color:#000 style ACTION2 fill:#e9c46a,color:#000

Wrap-up & Next Steps

Switching from EC2 to ELB health checks is the right move for any web or API workload behind a load balancer — but it's only effective when paired with a correctly configured grace period, a reliable health check endpoint, and proper security group rules. Blindly switching without these prerequisites will cause a different failure mode: a termination storm where every instance is immediately marked unhealthy.

Action items:

Audit your current ASG health check type: aws autoscaling describe-auto-scaling-groups --query 'AutoScalingGroups[*].{Name:AutoScalingGroupName,HealthCheckType:HealthCheckType,GracePeriod:HealthCheckGracePeriod}'
Measure your application's actual startup time before setting the grace period.
Implement a dedicated /health endpoint that validates critical dependencies (DB connectivity, cache availability) — not just HTTP 200.
Review AWS documentation: Health checks for Auto Scaling instances.

Glossary

Term	Definition
Health Check Grace Period	A configurable time window (in seconds) after an instance launches during which ASG ignores health check failures. Prevents premature termination during application startup.
EC2 Status Checks	Automated hypervisor-level checks performed by AWS that evaluate the reachability and functionality of the underlying hardware and the EC2 instance OS kernel.
Target Group Health Check	Active probes sent by an ALB or NLB to registered targets to determine if they can receive and respond to traffic. Configured per Target Group with path, protocol, thresholds, and interval.
Unhealthy Threshold	The number of consecutive failed health checks required before a Target Group marks an instance as unhealthy. Prevents flapping from transient failures.
Termination Policy	The strategy ASG uses to select which instance to terminate during scale-in events (e.g., `OldestInstance`, `Default`). Separate from health-check-driven terminations.

Search This Blog

SW BBANG