EC2 System Status Check Failure: What It Means and How to Fix It

You log into the EC2 console and see "1/2 checks failed" — specifically the System Status Check. Before you panic or blindly reboot, you need to understand what this check actually monitors and why the fix is almost always a Stop → Start, not a reboot.

TL;DR

Check Type	What It Monitors	Who Owns the Problem	Fix
System Status Check	AWS underlying physical host (hardware, network, power)	AWS	Stop & Start (migrates to new host)
Instance Status Check	Your instance OS, kernel, network config, file system	You	Reboot, OS-level troubleshooting

The Two-Layer Health Model

AWS EC2 health checks operate at two distinct layers. Think of it as a two-story building: the System Status Check inspects the foundation and plumbing (AWS infrastructure), while the Instance Status Check inspects what's happening inside your apartment (your OS and application stack).

Analogy: A System Status Check failure is like your apartment building losing power from the utility grid — it's not your fault, and moving to a different unit (host) is the only real fix. An Instance Status Check failure is like a blown fuse inside your own apartment — you need to go in and fix it yourself.

AWS Infrastructure Layer: The physical host running your EC2 instance is continuously monitored by AWS for hardware faults, loss of network connectivity, loss of system power, and software issues on the host.
System Status Check: Probes the AWS-managed hypervisor and physical host. A failure here means AWS has detected a problem with the underlying infrastructure — not your instance.
Instance Status Check: Probes the virtual machine itself — checks that the OS is accepting network traffic, the kernel hasn't panicked, and the file system is healthy.
Your Application: Sits on top of the OS layer. Both checks must pass for your instance to be considered fully healthy.

Why a Reboot Does NOT Fix a System Status Check Failure

This is the most critical point. A reboot restarts the OS on the same physical host. If the underlying hardware is degraded, you're restarting on broken infrastructure — the system status check will fail again immediately.

A Stop → Start is fundamentally different: AWS terminates the instance on the current host and provisions it on a new, healthy physical host. This is the correct remediation for a System Status Check failure.

graph LR subgraph Wrong_Fix [Reboot - Wrong Fix] R1[Degraded Host] -->|Reboot| R2[Same Degraded Host] R2 --> R3[System Check Fails Again] end subgraph Correct_Fix [Stop and Start - Correct Fix] S1[Degraded Host] -->|Stop| S2[Instance De-provisioned] S2 -->|Start| S3[New Healthy Host] S3 --> S4[System Check Passes] end

Reboot path (wrong fix): The instance restarts but remains on the same degraded physical host. The System Status Check fails again.
Stop → Start path (correct fix): AWS de-provisions the instance from the faulty host and schedules it on a new, healthy host. The System Status Check passes.
Critical caveat: Stop → Start only works for EBS-backed instances. The root volume persists on EBS and reattaches to the new host. Instance store-backed instances cannot be stopped — they must be terminated and relaunched.

Step-by-Step: How to Diagnose and Resolve

Step 1 — Confirm Which Check Is Failing

In the EC2 Console, select your instance → Status checks tab. Identify whether it's the System Status Check, Instance Status Check, or both.

Step 2 — Check for AWS Scheduled Events

AWS often detects host degradation before you do and schedules a maintenance event. Check the Events column in the EC2 console or use the CLI:

aws ec2 describe-instance-status \
  --instance-ids i-0abcdef1234567890 \
  --region us-east-1 \
  --query 'InstanceStatuses[*].Events'

If AWS has already scheduled a host retirement or maintenance event, they may migrate the instance automatically. You can also proactively stop and start to migrate immediately.

Step 3 — Stop and Start the Instance (EBS-Backed)

Via AWS CLI:

# Stop the instance
aws ec2 stop-instances \
  --instance-ids i-0abcdef1234567890 \
  --region us-east-1

# Wait until stopped
aws ec2 wait instance-stopped \
  --instance-ids i-0abcdef1234567890 \
  --region us-east-1

# Start the instance on a new host
aws ec2 start-instances \
  --instance-ids i-0abcdef1234567890 \
  --region us-east-1

⚠️ Important: If your instance has a public IPv4 address (not an Elastic IP), it will change after Stop → Start. Assign an Elastic IP to avoid this.

Step 4 — Automate Recovery with CloudWatch Alarms

Instead of manually monitoring, configure a CloudWatch alarm to automatically recover the instance when a System Status Check fails. The EC2 Instance Recovery action migrates the instance to a new host while preserving the instance ID, Elastic IP, and EBS volumes.

🔽 [Click to expand] CloudWatch Alarm — Auto-Recovery JSON (CloudFormation)

{
  "Type": "AWS::CloudWatch::Alarm",
  "Properties": {
    "AlarmName": "EC2-SystemStatusCheck-AutoRecover",
    "AlarmDescription": "Trigger EC2 recovery on system status check failure",
    "Namespace": "AWS/EC2",
    "MetricName": "StatusCheckFailed_System",
    "Dimensions": [
      {
        "Name": "InstanceId",
        "Value": "i-0abcdef1234567890"
      }
    ],
    "Statistic": "Maximum",
    "Period": 60,
    "EvaluationPeriods": 2,
    "Threshold": 1,
    "ComparisonOperator": "GreaterThanOrEqualToThreshold",
    "AlarmActions": [
      "arn:aws:automate:us-east-1:ec2:recover"
    ]
  }
}

Note: The EC2 Auto-Recovery action (arn:aws:automate:<region>:ec2:recover) is supported only for instances that use EBS volumes for all storage and are not instance store-backed. Check the official documentation for supported instance types.

What Causes System Status Check Failures?

According to AWS documentation, System Status Check failures are caused by problems with the underlying AWS infrastructure, including:

Loss of network connectivity to the physical host
Loss of system power on the physical host
Software issues on the underlying host (hypervisor-level)
Hardware issues affecting network reachability of the physical host

These are entirely within AWS's responsibility boundary. You cannot SSH into the host to fix them — the only lever you have is migrating your instance to a new host via Stop → Start or the Recovery action.

Prevention: Architect for Host-Level Failures

A single EC2 instance is a single point of failure. System Status Check failures are a reminder that physical hardware degrades. Design accordingly:

graph LR ALB[Application Load Balancer] --> AZ1 ALB --> AZ2 subgraph AZ1 [Availability Zone A] ASG1[EC2 Instance - ASG] end subgraph AZ2 [Availability Zone B] ASG2[EC2 Instance - ASG] end CW[CloudWatch Alarm - StatusCheckFailed] -->|Auto-Recover or Replace| ASG1 CW -->|Auto-Recover or Replace| ASG2

Auto Scaling Group (ASG) with min=1: If the instance fails, ASG automatically launches a replacement. Pair with a CloudWatch alarm on StatusCheckFailed_System.
Multi-AZ deployment: Distribute instances across Availability Zones. A host failure in one AZ does not impact instances in another.
Elastic Load Balancer (ELB): Routes traffic only to healthy instances. Combined with ASG, this provides seamless failover.
CloudWatch Auto-Recovery: For single-instance workloads (e.g., a dev server), the recovery alarm is a lightweight safety net that preserves the instance identity.

Glossary

Term	Definition
System Status Check	An AWS-managed health probe that monitors the physical host and hypervisor infrastructure underlying your EC2 instance.
Instance Status Check	A health probe that monitors the virtual machine's OS, kernel, and network configuration — within your responsibility boundary.
EC2 Instance Recovery	A CloudWatch alarm action that migrates an EC2 instance to a new healthy host while preserving its instance ID, Elastic IPs, and EBS volumes.
EBS-Backed Instance	An EC2 instance whose root volume is an EBS volume, allowing it to be stopped and restarted (and thus migrated to a new host).
Elastic IP	A static public IPv4 address that remains associated with your AWS account and can be remapped to a new instance, surviving Stop → Start cycles.

Search This Blog

SW BBANG