Why Is My EC2 Instance Failing the System Status Check (And How to Fix It)

Seeing 1/2 checks failed on an EC2 instance — specifically a System Status Check failure — is one of those moments that stops a deployment cold. Unlike an instance status check failure (which points to your OS or application), a system status check failure signals a problem at the AWS infrastructure layer beneath your instance. Understanding exactly what that means, and what actions are actually available to you, prevents wasted time and unnecessary risk.

TL;DR: EC2 System Status Check Failure at a Glance

AttributeSystem Status CheckInstance Status Check
What it monitorsAWS infrastructure (host hardware, network, power)Your instance OS and software
Who owns the problemAWSYou
Primary fixStop & Start (migrates to new host)Reboot, OS repair, or AMI rebuild
Reboot sufficient?No — reboot keeps the same hostSometimes
Data risk on fixInstance store data lost on Stop/StartDepends on action taken

How EC2 Status Checks Actually Work

AWS runs two independent health probes against every running instance every minute. The system status check tests the physical host: network reachability to the instance, power delivery, and the hypervisor layer. The instance status check tests the guest: whether the OS kernel is responding, whether the network interface inside the instance is functional, and whether the system clock is synchronized. When you see 1/2 checks failed, the number tells you which layer failed — not how severe the failure is.

graph TD A["AWS Physical Host
(hardware, power, network)"] -->|"System Status Check
monitors this layer"| B["System Status Check"] C["EC2 Guest Instance
(OS, kernel, NIC)"] -->|"Instance Status Check
monitors this layer"| D["Instance Status Check"] B --> E{"Status?"} D --> F{"Status?"} E -->|"impaired"| G["Host-level fault
Fix: Stop/Start"] E -->|"ok"| H["Infrastructure healthy"] F -->|"impaired"| I["OS/software fault
Fix: Reboot or OS repair"] F -->|"ok"| J["Guest OS healthy"] G --> K["Instance migrates
to new physical host"]
  1. AWS Infrastructure Layer — physical host hardware, hypervisor, and network fabric. System status check monitors this layer.
  2. EC2 Instance Layer — your guest OS, kernel, and network interface. Instance status check monitors this layer.
  3. A failure at Layer 1 does not necessarily mean your OS is unhealthy — the guest may be perfectly fine, simply unreachable due to a host-level fault.
  4. A Stop/Start migrates the instance to a different physical host. A Reboot does not — it restarts the guest on the same host, leaving the underlying hardware fault unresolved.

Why a System Status Check Fails: Root Causes

AWS does not expose the specific hardware fault to customers, but the documented causes fall into a small set of categories. Knowing these shapes your response — particularly whether to wait for AWS to remediate automatically or to act immediately.

  • Host hardware degradation — a physical component (disk, NIC, memory) on the underlying host is failing or has failed.
  • Host network connectivity loss — the physical host has lost connectivity to the AWS network fabric.
  • Power issues — the physical host has lost power or is experiencing power instability.
  • Software issues on the host — hypervisor-level software faults that AWS manages.
Think of the physical host as the rack server in a data center you don't own. When that server's NIC fails, your VM is unreachable — not because your OS crashed, but because the physical network path is broken. You can't fix it from inside the VM.

Diagnosing a System Status Check Failure Step by Step

Step 1: Confirm Which Check Is Failing

Before taking any action, confirm you are dealing with a system status check failure and not an instance status check failure — the remediation paths diverge completely. The console label can be ambiguous under load; the CLI gives you an unambiguous answer. This also captures the exact failure state for any support case you may need to open.

aws ec2 describe-instance-status \
  --instance-ids i-0123456789abcdef0 \
  --region us-east-1 \
  --output table

Look for SystemStatusStatus: impaired and InstanceStatusStatus: ok. If both show impaired, you have a compound problem and the instance-level issue needs separate investigation after the host is resolved.

Step 2: Check for an Active AWS Scheduled Event

AWS sometimes schedects maintenance events — including host retirements — that appear as system status check failures before the event window opens. If a scheduled event exists, AWS may migrate the instance automatically within the event window, and acting before that window can disrupt the managed process. This step tells you whether you are ahead of AWS or behind it.

aws ec2 describe-instance-status \
  --instance-ids i-0123456789abcdef0 \
  --region us-east-1 \
  --query 'InstanceStatuses[*].Events'

If the output contains an event with Code: instance-retirement or Code: system-reboot, read the NotBefore and NotAfter timestamps. For a retirement event, a Stop/Start before the deadline migrates you cleanly. For a reboot event, AWS will reboot the instance on your behalf — you may choose to act first to control the timing.

Step 3: Verify the Instance Root Volume Is EBS-Backed

The correct fix for a system status check failure is a Stop/Start — but this only applies to EBS-backed instances. Instance store-backed instances cannot be stopped; they can only be rebooted or terminated. More critically, a Stop/Start on an instance with instance store volumes will destroy all data on those volumes. Confirming the root device type before acting prevents an irreversible data loss event.

aws ec2 describe-instances \
  --instance-ids i-0123456789abcdef0 \
  --region us-east-1 \
  --query 'Reservations[*].Instances[*].RootDeviceType'

If the result is ebs, proceed to the Stop/Start. If the result is instance-store, your options are limited to waiting for AWS to remediate the host or terminating and relaunching from an AMI.

Step 4: Check for Instance Store Volumes Before Stopping

Even on an EBS-backed instance, additional instance store volumes may be attached. Data on those volumes is lost permanently when the instance stops — it does not survive the host migration. This step is not about whether you can stop the instance; it is about whether you have data you cannot afford to lose before you do.

aws ec2 describe-instances \
  --instance-ids i-0123456789abcdef0 \
  --region us-east-1 \
  --query 'Reservations[*].Instances[*].BlockDeviceMappings'

Cross-reference the output against your instance type's documented instance store configuration. If ephemeral data exists that must be preserved, copy it to EBS or S3 before proceeding — if the instance is still reachable.

Step 5: Stop and Start the Instance (Not Reboot)

With the above checks complete, execute the Stop/Start. The stop deallocates the instance from the degraded host. The subsequent start allocates it to a healthy host in the same Availability Zone. This is the documented AWS-recommended action for system status check failures on EBS-backed instances.

aws ec2 stop-instances \
  --instance-ids i-0123456789abcdef0 \
  --region us-east-1

Wait for the instance to reach the stopped state before starting:

aws ec2 wait instance-stopped \
  --instance-ids i-0123456789abcdef0 \
  --region us-east-1

aws ec2 start-instances \
  --instance-ids i-0123456789abcdef0 \
  --region us-east-1

After the instance reaches running, re-run the describe-instance-status command from Step 1 to confirm both checks return ok.

Step 6: Confirm Status After Migration

The instance status checks run every minute, but it can take several minutes after start for both checks to return ok — the system check needs the new host to be validated, and the instance check needs the OS to fully boot. Polling too early produces misleading results. Use the CLI waiter to block until the instance passes both checks rather than reading intermediate states.

aws ec2 wait instance-status-ok \
  --instance-ids i-0123456789abcdef0 \
  --region us-east-1

If the system status check remains impaired after the instance is running on the new host, open an AWS Support case. At that point the failure is no longer a host migration issue — it may indicate a broader infrastructure event in the Availability Zone.

The Misdiagnosis That Costs Hours

A common failure pattern: an engineer sees 1/2 checks failed, SSHes into the instance, finds the application running normally, and concludes the status check is a false alarm. The instance gets left running on the degraded host. Thirty minutes later, the host degrades further, the instance becomes completely unreachable, and now there is no clean window to copy data off instance store volumes.

The non-obvious behavior here is that a system status check failure does not immediately impair the guest OS. The hypervisor can continue running the VM even when the physical host's network path is partially degraded — until it isn't. The status check is an early warning, not a post-mortem. Acting on it while the instance is still reachable is always cheaper than acting after it goes dark.

The correct mental model: a system status check failure is AWS telling you the floor is cracking. Your furniture (the OS, the application) is fine right now. Move it before the floor gives way.

Automating Recovery with CloudWatch and EC2 Auto Recovery

For production instances where manual intervention introduces unacceptable recovery time, EC2 Auto Recovery can automatically recover an instance when a system status check fails. Auto Recovery migrates the instance to a new host, preserving the instance ID, Elastic IP addresses, EBS volumes, and instance metadata. It is not available for all instance types — verify support for your specific instance type in the AWS documentation before relying on it.

The CloudWatch alarm that triggers auto recovery targets the StatusCheckFailed_System metric:

🔽 Click to expand: CloudWatch Auto Recovery Alarm (CLI)
aws cloudwatch put-metric-alarm \
  --alarm-name "EC2-System-Status-AutoRecover" \
  --namespace "AWS/EC2" \
  --metric-name "StatusCheckFailed_System" \
  --dimensions Name=InstanceId,Value=i-0123456789abcdef0 \
  --statistic Minimum \
  --period 60 \
  --evaluation-periods 2 \
  --threshold 1 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --alarm-actions "arn:aws:automate:us-east-1:ec2:recover" \
  --region us-east-1

Two consecutive one-minute periods of StatusCheckFailed_System = 1 triggers the recovery action. The arn:aws:automate:us-east-1:ec2:recover ARN is the documented recover action ARN — substitute your target region. Note that auto recovery is subject to the same instance type and configuration constraints as a manual Stop/Start; instance store data is not preserved.

graph LR A["System Status Check
Fails"] --> B["CloudWatch Metric
StatusCheckFailed_System = 1"] B --> C["CloudWatch Alarm
Evaluates 2 periods"] C -->|"Threshold breached"| D["Alarm: ALARM state"] D --> E["ec2:recover action
triggered"] E --> F["Instance migrated
to healthy host"] F --> G["EBS, Instance ID,
Elastic IP preserved"] C -->|"No auto recovery
configured"| H["Manual intervention
required"] H --> I["Stop/Start
via Console or CLI"]
  1. System Status Check Fails — AWS detects a host-level fault and marks the system check as impaired.
  2. CloudWatch Alarm Triggers — after the configured evaluation periods, the alarm transitions to ALARM state and invokes the recover action.
  3. Auto Recovery Executes — AWS migrates the instance to a healthy host, preserving EBS volumes, instance ID, and Elastic IPs.
  4. Manual Path — if auto recovery is not configured, the operator must execute the Stop/Start manually after confirming the failure type and instance store data status.

IAM Permissions Required

The diagnostic and remediation steps above require the following minimum IAM permissions. Apply these to the role or user executing the commands, scoped to the specific instance where possible.

🔽 Click to expand: Minimum IAM Policy
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DescribeInstanceStatus",
      "Effect": "Allow",
      "Action": [
        "ec2:DescribeInstanceStatus",
        "ec2:DescribeInstances"
      ],
      "Resource": "*"
    },
    {
      "Sid": "StopStartInstance",
      "Effect": "Allow",
      "Action": [
        "ec2:StopInstances",
        "ec2:StartInstances"
      ],
      "Resource": "arn:aws:ec2:us-east-1:123456789012:instance/i-0123456789abcdef0"
    },
    {
      "Sid": "PutCloudWatchAlarm",
      "Effect": "Allow",
      "Action": [
        "cloudwatch:PutMetricAlarm"
      ],
      "Resource": "*"
    }
  ]
}

ec2:DescribeInstanceStatus and ec2:DescribeInstances are read actions that require "Resource": "*" — they do not support resource-level restrictions in IAM. The stop and start actions do support instance-level ARN restrictions and should be scoped accordingly.

Wrap-Up: Resolving EC2 System Status Check Failures

A system status check failure is an AWS infrastructure problem, not an OS problem — and the fix is a Stop/Start, not a reboot. The diagnostic sequence matters: confirm which check is failing, check for scheduled events, verify your root device type, account for instance store data, then execute the migration. For production workloads, a CloudWatch auto recovery alarm eliminates the manual response window entirely. If the system status check persists after migration to a new host, escalate to AWS Support — the issue has moved beyond a single host fault.

For related reading, see the AWS documentation on EC2 Status Checks for Your Instances and Recover Your Instance.

Glossary

TermDefinition
System Status CheckAn AWS-managed health probe that monitors the physical host, hypervisor, and network infrastructure underlying an EC2 instance.
Instance Status CheckAn AWS-managed health probe that monitors the guest OS, kernel responsiveness, and in-instance network interface.
Stop/StartThe EC2 lifecycle action that deallocates an instance from its current host and reallocates it to a new host on the next start. Distinct from a reboot, which keeps the same host.
EC2 Auto RecoveryA CloudWatch alarm action (ec2:recover) that automatically migrates an impaired instance to a healthy host, preserving EBS-backed configuration.
Instance StoreEphemeral block storage physically attached to the host. Data on instance store volumes is lost when an instance stops or is terminated.

Related Posts

Comments

Popular posts from this blog

EC2 No Internet Access in Custom VPC: Fix Internet Gateway and Route Table

EC2 SSH Connection Timeout: Which Security Group Rules to Check

Difference Between IAM User and IAM Role: Which One Should Your EC2 Use?