EC2 System Status Check Failure: What It Means and How to Fix It
You log into the EC2 console and see "1/2 checks failed" — specifically the System Status Check. Before you panic or blindly reboot, you need to understand what this check actually monitors and why the fix is almost always a Stop → Start, not a reboot.
TL;DR
| Check Type | What It Monitors | Who Owns the Problem | Fix |
|---|---|---|---|
| System Status Check | AWS underlying physical host (hardware, network, power) | AWS | Stop & Start (migrates to new host) |
| Instance Status Check | Your instance OS, kernel, network config, file system | You | Reboot, OS-level troubleshooting |
The Two-Layer Health Model
AWS EC2 health checks operate at two distinct layers. Think of it as a two-story building: the System Status Check inspects the foundation and plumbing (AWS infrastructure), while the Instance Status Check inspects what's happening inside your apartment (your OS and application stack).
Analogy: A System Status Check failure is like your apartment building losing power from the utility grid — it's not your fault, and moving to a different unit (host) is the only real fix. An Instance Status Check failure is like a blown fuse inside your own apartment — you need to go in and fix it yourself.
- AWS Infrastructure Layer: The physical host running your EC2 instance is continuously monitored by AWS for hardware faults, loss of network connectivity, loss of system power, and software issues on the host.
- System Status Check: Probes the AWS-managed hypervisor and physical host. A failure here means AWS has detected a problem with the underlying infrastructure — not your instance.
- Instance Status Check: Probes the virtual machine itself — checks that the OS is accepting network traffic, the kernel hasn't panicked, and the file system is healthy.
- Your Application: Sits on top of the OS layer. Both checks must pass for your instance to be considered fully healthy.
Why a Reboot Does NOT Fix a System Status Check Failure
This is the most critical point. A reboot restarts the OS on the same physical host. If the underlying hardware is degraded, you're restarting on broken infrastructure — the system status check will fail again immediately.
A Stop → Start is fundamentally different: AWS terminates the instance on the current host and provisions it on a new, healthy physical host. This is the correct remediation for a System Status Check failure.
- Reboot path (wrong fix): The instance restarts but remains on the same degraded physical host. The System Status Check fails again.
- Stop → Start path (correct fix): AWS de-provisions the instance from the faulty host and schedules it on a new, healthy host. The System Status Check passes.
- Critical caveat: Stop → Start only works for EBS-backed instances. The root volume persists on EBS and reattaches to the new host. Instance store-backed instances cannot be stopped — they must be terminated and relaunched.
Step-by-Step: How to Diagnose and Resolve
Step 1 — Confirm Which Check Is Failing
In the EC2 Console, select your instance → Status checks tab. Identify whether it's the System Status Check, Instance Status Check, or both.
Step 2 — Check for AWS Scheduled Events
AWS often detects host degradation before you do and schedules a maintenance event. Check the Events column in the EC2 console or use the CLI:
aws ec2 describe-instance-status \
--instance-ids i-0abcdef1234567890 \
--region us-east-1 \
--query 'InstanceStatuses[*].Events'
If AWS has already scheduled a host retirement or maintenance event, they may migrate the instance automatically. You can also proactively stop and start to migrate immediately.
Step 3 — Stop and Start the Instance (EBS-Backed)
Via AWS CLI:
# Stop the instance
aws ec2 stop-instances \
--instance-ids i-0abcdef1234567890 \
--region us-east-1
# Wait until stopped
aws ec2 wait instance-stopped \
--instance-ids i-0abcdef1234567890 \
--region us-east-1
# Start the instance on a new host
aws ec2 start-instances \
--instance-ids i-0abcdef1234567890 \
--region us-east-1
⚠️ Important: If your instance has a public IPv4 address (not an Elastic IP), it will change after Stop → Start. Assign an Elastic IP to avoid this.
Step 4 — Automate Recovery with CloudWatch Alarms
Instead of manually monitoring, configure a CloudWatch alarm to automatically recover the instance when a System Status Check fails. The EC2 Instance Recovery action migrates the instance to a new host while preserving the instance ID, Elastic IP, and EBS volumes.
🔽 [Click to expand] CloudWatch Alarm — Auto-Recovery JSON (CloudFormation)
{
"Type": "AWS::CloudWatch::Alarm",
"Properties": {
"AlarmName": "EC2-SystemStatusCheck-AutoRecover",
"AlarmDescription": "Trigger EC2 recovery on system status check failure",
"Namespace": "AWS/EC2",
"MetricName": "StatusCheckFailed_System",
"Dimensions": [
{
"Name": "InstanceId",
"Value": "i-0abcdef1234567890"
}
],
"Statistic": "Maximum",
"Period": 60,
"EvaluationPeriods": 2,
"Threshold": 1,
"ComparisonOperator": "GreaterThanOrEqualToThreshold",
"AlarmActions": [
"arn:aws:automate:us-east-1:ec2:recover"
]
}
}
Note: The EC2 Auto-Recovery action (arn:aws:automate:<region>:ec2:recover) is supported only for instances that use EBS volumes for all storage and are not instance store-backed. Check the official documentation for supported instance types.
What Causes System Status Check Failures?
According to AWS documentation, System Status Check failures are caused by problems with the underlying AWS infrastructure, including:
- Loss of network connectivity to the physical host
- Loss of system power on the physical host
- Software issues on the underlying host (hypervisor-level)
- Hardware issues affecting network reachability of the physical host
These are entirely within AWS's responsibility boundary. You cannot SSH into the host to fix them — the only lever you have is migrating your instance to a new host via Stop → Start or the Recovery action.
Prevention: Architect for Host-Level Failures
A single EC2 instance is a single point of failure. System Status Check failures are a reminder that physical hardware degrades. Design accordingly:
- Auto Scaling Group (ASG) with min=1: If the instance fails, ASG automatically launches a replacement. Pair with a CloudWatch alarm on
StatusCheckFailed_System. - Multi-AZ deployment: Distribute instances across Availability Zones. A host failure in one AZ does not impact instances in another.
- Elastic Load Balancer (ELB): Routes traffic only to healthy instances. Combined with ASG, this provides seamless failover.
- CloudWatch Auto-Recovery: For single-instance workloads (e.g., a dev server), the recovery alarm is a lightweight safety net that preserves the instance identity.
Glossary
| Term | Definition |
|---|---|
| System Status Check | An AWS-managed health probe that monitors the physical host and hypervisor infrastructure underlying your EC2 instance. |
| Instance Status Check | A health probe that monitors the virtual machine's OS, kernel, and network configuration — within your responsibility boundary. |
| EC2 Instance Recovery | A CloudWatch alarm action that migrates an EC2 instance to a new healthy host while preserving its instance ID, Elastic IPs, and EBS volumes. |
| EBS-Backed Instance | An EC2 instance whose root volume is an EBS volume, allowing it to be stopped and restarted (and thus migrated to a new host). |
| Elastic IP | A static public IPv4 address that remains associated with your AWS account and can be remapped to a new instance, surviving Stop → Start cycles. |
Next Steps
- 📖 AWS Docs: Monitor the status of your instances
- 📖 AWS Docs: Recover your instance
- 📖 AWS Docs: Scheduled events for your EC2 instances
Related Posts
- 📄 Why CloudWatch Doesn't Show EC2 Memory Usage (And How to Fix It)
- 📄 EC2 SSH Connection Timeout: The Exact Security Group Rules You Need to Fix It
- 📄 EC2 No Internet Access in Custom VPC: Attaching an Internet Gateway and Fixing Route Tables
- 📄 Understanding AWS T3 Burstable Instances: CPU Credits, Throttling, and When to Upgrade
Comments
Post a Comment