EC2 High Network I/O but Low CPU: Diagnosing Traffic with CloudWatch and VPC Flow Logs

You're staring at an EC2 instance that feels sluggish — page loads are slow, API responses are dragging — but CloudWatch shows CPU sitting at 10%. The instinct is to blame the application, but the real bottleneck is somewhere in the network layer. This post walks through how to use CloudWatch network metrics and VPC Flow Logs to trace exactly where the traffic is going and what's causing the slowdown.

TL;DR: EC2 High Network I/O Diagnosis

StepToolWhat You're Looking For
1. Confirm network saturationCloudWatch EC2 metricsNetworkIn/NetworkOut spike vs. baseline
2. Check instance bandwidth capEC2 instance type specsBaseline vs. burst network throughput
3. Identify traffic directionCloudWatch NetworkPacketsIn/OutAsymmetric packet counts suggest specific patterns
4. Trace source/destinationVPC Flow LogsTop talkers, unexpected destinations, rejected traffic
5. Correlate with applicationCloudWatch + Flow Log timestampsAlign traffic spikes with request patterns

How EC2 Network Performance Actually Works

EC2 network performance is not a flat line. Most instance types operate on a baseline throughput with a burst allowance governed by a network I/O credit mechanism — similar in concept to CPU credits on burstable instances, but the specifics differ by instance family. When an instance exhausts its burst allowance, throughput is throttled to baseline. This throttling is silent: no error, no alarm by default, just degraded latency and throughput that looks like an application problem.

The key distinction: NetworkIn and NetworkOut in CloudWatch measure bytes transferred, not bandwidth utilization percentage. There is no native 'network utilization %' metric. You have to correlate the byte counts against the documented baseline and burst limits for your specific instance type to determine whether you're hitting the ceiling.

Additionally, EC2 Enhanced Networking (using the Elastic Network Adapter) is required to reach the higher throughput tiers on modern instance types. If your AMI or instance type does not have ENA enabled, you may be capped well below the advertised limit.

graph TD App["Application Layer
generates/receives traffic"] --> ENA["ENA Driver
Enhanced Networking"] ENA --> HV["Hypervisor / Network Fabric
enforces bandwidth limits"] HV --> VPC["VPC Routing Layer"] VPC --> FL["VPC Flow Logs
captures ENI-level flows"] VPC --> IGW["Internet Gateway / NAT / VPC Endpoint"] HV --> CW["CloudWatch
NetworkIn / NetworkOut metrics"] style App fill:#4a90d9,color:#fff style ENA fill:#e8a838,color:#fff style HV fill:#d9534f,color:#fff style VPC fill:#5cb85c,color:#fff style FL fill:#9b59b6,color:#fff style CW fill:#9b59b6,color:#fff style IGW fill:#5cb85c,color:#fff
  1. Application layer generates or receives traffic — this is what you observe as slowness.
  2. EC2 network stack processes packets through the ENA driver. If ENA is not enabled, throughput is capped at a lower tier.
  3. Hypervisor / network fabric enforces per-instance bandwidth limits. Burst credits deplete silently.
  4. VPC routes traffic. Flow Logs capture accepted and rejected flows at the ENI level.
  5. CloudWatch aggregates NetworkIn/Out at 1-minute or 5-minute granularity — spikes shorter than the period can be masked.

Step 1: Confirm Network Saturation with CloudWatch EC2 Metrics

Before blaming the network, verify it is actually the bottleneck. Pull NetworkIn and NetworkOut for the instance over the past hour at 1-minute granularity. A sustained high value relative to your instance type's baseline is your first confirmation signal. Also pull NetworkPacketsIn and NetworkPacketsOut — a high packet rate with moderate byte count points toward many small requests (connection overhead, keep-alive churn, or a scanning pattern) rather than bulk data transfer.

# Pull NetworkIn at 1-minute resolution for the past hour
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name NetworkIn \
  --dimensions Name=InstanceId,Value=i-0123456789abcdef0 \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 60 \
  --statistics Maximum \
  --region us-east-1
# Pull NetworkPacketsIn to check packet rate alongside byte volume
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name NetworkPacketsIn \
  --dimensions Name=InstanceId,Value=i-0123456789abcdef0 \
  --start-time $(date -u -d '1 hour ago' +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 60 \
  --statistics Maximum \
  --region us-east-1

Use Maximum rather than Average here. Averaging over a 1-minute period can smooth out a 10-second burst that is actually saturating your connection layer. Maximum preserves the peak.

Think of it like monitoring a highway: average cars per hour tells you nothing about the 30-second gridlock at the on-ramp. Maximum catches the jam.

Step 2: Check Your Instance Type's Network Bandwidth Ceiling

This is where most engineers waste an hour. They see high NetworkIn/Out and immediately start looking at the application — but the instance is simply hitting its documented throughput limit. Cross-reference your instance type against the EC2 instance type documentation for baseline and burst network bandwidth. For example, general-purpose and compute-optimized families each have different baseline-to-burst ratios.

# Describe the instance to confirm instance type
aws ec2 describe-instances \
  --instance-ids i-0123456789abcdef0 \
  --query 'Reservations[0].Instances[0].{InstanceType:InstanceType,EnaSupport:EnaSupport,NetworkInterfaces:NetworkInterfaces[0].NetworkInterfaceId}' \
  --region us-east-1

Check the EnaSupport field. If it returns false, the instance is not using Enhanced Networking and will be capped below the advertised maximum. This is a silent cap — no CloudWatch alarm fires, no error in logs.

Pricing and exact bandwidth limits vary by instance type and generation — always verify against the official EC2 instance types page.

Step 3: Enable VPC Flow Logs to Identify Traffic Sources and Destinations

CloudWatch metrics tell you how much traffic. VPC Flow Logs tell you where it's going and who's sending it. If Flow Logs are not already enabled for the VPC or the specific ENI, enable them now. Sending to CloudWatch Logs gives you query capability via Logs Insights; sending to S3 gives you Athena query capability for larger datasets.

# Find the ENI attached to your instance
aws ec2 describe-instances \
  --instance-ids i-0123456789abcdef0 \
  --query 'Reservations[0].Instances[0].NetworkInterfaces[*].NetworkInterfaceId' \
  --region us-east-1
# Enable Flow Logs on the ENI, sending to CloudWatch Logs
aws ec2 create-flow-logs \
  --resource-type NetworkInterface \
  --resource-ids eni-0123456789abcdef0 \
  --traffic-type ALL \
  --log-destination-type cloud-watch-logs \
  --log-group-name /vpc/flowlogs/instance-debug \
  --deliver-logs-permission-arn arn:aws:iam::123456789012:role/FlowLogsDeliveryRole \
  --region us-east-1

The IAM role referenced above must trust the vpc-flow-logs.amazonaws.com service principal and have permissions to create log streams and put log events in the target log group. A minimal policy for this role:

🔽 Click to expand: FlowLogs IAM Role Policy
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents",
        "logs:DescribeLogGroups",
        "logs:DescribeLogStreams"
      ],
      "Resource": "arn:aws:logs:us-east-1:123456789012:log-group:/vpc/flowlogs/*"
    }
  ]
}
# Trust policy for the role (attach to the role's trust relationship)
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "vpc-flow-logs.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

Step 4: Query VPC Flow Logs to Find Top Talkers

Once Flow Logs are delivering to CloudWatch Logs, use Logs Insights to identify the top source and destination IPs by byte volume. This is where the investigation gets concrete — you stop guessing and start reading actual flow records.

# CloudWatch Logs Insights query — run this in the console or via CLI
# against your flow log group
fields @timestamp, srcAddr, dstAddr, srcPort, dstPort, protocol, bytes, action
| filter interfaceId = 'eni-0123456789abcdef0'
| stats sum(bytes) as totalBytes by srcAddr, dstAddr
| sort totalBytes desc
| limit 20
# Run the same query via CLI
aws logs start-query \
  --log-group-name /vpc/flowlogs/instance-debug \
  --start-time $(date -d '1 hour ago' +%s) \
  --end-time $(date +%s) \
  --query-string 'fields @timestamp, srcAddr, dstAddr, bytes, action | filter interfaceId = "eni-0123456789abcdef0" | stats sum(bytes) as totalBytes by srcAddr, dstAddr | sort totalBytes desc | limit 20' \
  --region us-east-1
# Retrieve query results (use the queryId returned by start-query)
aws logs get-query-results \
  --query-id  \
  --region us-east-1

Look for three patterns in the results:

  • Unexpected external destinations — traffic flowing to IPs outside your expected service topology. This can indicate data exfiltration, a misconfigured dependency, or a chatty SDK retry loop hitting an external endpoint.
  • High volume from a single source — one IP sending disproportionate traffic. Could be a legitimate client hammering the instance, or a scanning/DDoS pattern.
  • REJECT actions with high packet counts — Security Group or NACL rejections at volume indicate something is actively trying to reach the instance and being blocked, consuming network processing overhead.
graph TD FLQ["VPC Flow Log Query
Logs Insights"] --> P1["Pattern: High srcAddr bytes
from single external IP"] FLQ --> P2["Pattern: High dstAddr bytes
to unexpected destination"] FLQ --> P3["Pattern: REJECT action
high packet count"] P1 --> D1["Diagnosis: Inbound flood
DDoS or aggressive client"] P2 --> D2["Diagnosis: Outbound saturation
chatty dependency or exfil"] P3 --> D3["Diagnosis: SG/NACL blocking
scanning or misconfigured peer"] D1 --> F1["Fix: WAF / rate limiting
Security Group restriction"] D2 --> F2["Fix: Async I/O pattern
caching / larger instance type"] D3 --> F3["Fix: Review SG rules
check NACL ordering"] style FLQ fill:#4a90d9,color:#fff style P1 fill:#e8a838,color:#fff style P2 fill:#e8a838,color:#fff style P3 fill:#e8a838,color:#fff style D1 fill:#d9534f,color:#fff style D2 fill:#d9534f,color:#fff style D3 fill:#d9534f,color:#fff style F1 fill:#5cb85c,color:#fff style F2 fill:#5cb85c,color:#fff style F3 fill:#5cb85c,color:#fff

Step 5: Correlate Flow Log Timestamps with Application Slowness

Raw traffic volume is not enough — you need to align the network spike with what the application was doing at that moment. This is where most investigations stall: engineers look at traffic in aggregate and miss that the spike is 90 seconds long and perfectly correlated with a scheduled job, a deployment event, or a specific API endpoint getting hammered.

# Narrow the Logs Insights query to a specific time window around the slowness
fields @timestamp, srcAddr, dstAddr, bytes, action
| filter interfaceId = 'eni-0123456789abcdef0'
| filter @timestamp >= 1700000000000 and @timestamp <= 1700003600000
| stats sum(bytes) as totalBytes by bin(1m), srcAddr
| sort @timestamp asc

The bin(1m) aggregation gives you per-minute byte totals broken down by source — this makes it straightforward to overlay against your application logs or CloudWatch custom metrics to find the correlation.

The Misdiagnosis That Costs an Hour

Here is a pattern that comes up repeatedly in production: an instance shows high NetworkIn, low CPU, and slow response times. The first assumption is that the application is receiving too much inbound traffic and can't process it fast enough. Engineers start looking at connection pool exhaustion, thread counts, and request queuing.

The actual cause: the instance is generating high outbound traffic — NetworkOut is the saturated metric, not NetworkIn. The application is making synchronous calls to an S3 bucket or an external API, waiting for responses, and the outbound bandwidth is throttled because the instance exhausted its burst credits 20 minutes ago. The application threads are blocked on network I/O, not CPU. CPU stays low because the threads are waiting, not computing.

The fix is not scaling the instance — it's identifying which outbound calls are responsible, then either moving the data transfer to an async pattern, switching to a larger instance type with higher baseline bandwidth, or caching the responses. VPC Flow Logs make this visible immediately: you see the instance IP as the srcAddr with high byte counts flowing to an S3 endpoint or external service.

Low CPU with high network I/O almost always means threads are blocked waiting on I/O — not that the instance is underloaded. The instance is fully occupied; it's just not doing compute work.

Setting Up CloudWatch Alarms for Network Saturation

Once you've resolved the immediate issue, instrument it so you catch the next occurrence before users report it. Since there is no native 'network utilization %' metric, the practical approach is to alarm on NetworkOut or NetworkIn exceeding a threshold derived from your instance type's baseline bandwidth converted to bytes per CloudWatch period.

# Create an alarm on NetworkOut — adjust threshold to match your instance baseline
aws cloudwatch put-metric-alarm \
  --alarm-name ec2-high-networkout-i-0123456789abcdef0 \
  --metric-name NetworkOut \
  --namespace AWS/EC2 \
  --dimensions Name=InstanceId,Value=i-0123456789abcdef0 \
  --statistic Maximum \
  --period 60 \
  --evaluation-periods 3 \
  --threshold 500000000 \
  --comparison-operator GreaterThanThreshold \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts \
  --region us-east-1

The threshold value is in bytes per period. Calculate it from your instance type's documented baseline bandwidth — convert Gbps to bytes per 60-second period. Alarm after 3 consecutive breaches to avoid false positives from legitimate short bursts.

Wrap-Up and Next Steps: EC2 Network I/O Diagnosis

High network I/O with low CPU on an EC2 instance is almost always one of three things: the instance hitting its bandwidth ceiling silently, unexpected traffic from a misconfigured dependency, or application threads blocked on outbound I/O. CloudWatch network metrics confirm the volume; VPC Flow Logs identify the actors. Neither tool alone closes the investigation.

Key actions to take now:

  • Enable VPC Flow Logs on critical instances proactively — waiting until an incident to enable them means you have no historical data.
  • Verify ENA support is enabled on your instance type. Check the Enhanced Networking ENA documentation for verification steps.
  • Set CloudWatch alarms on NetworkIn and NetworkOut with thresholds derived from your instance type's documented baseline.
  • Review the VPC Flow Logs documentation for custom log format options — the default format may not include all fields you need for advanced queries.

Glossary

TermDefinition
NetworkIn / NetworkOutCloudWatch EC2 metrics measuring bytes received and sent by an instance's network interface, aggregated per CloudWatch period.
VPC Flow LogsA VPC feature that captures IP traffic metadata (source, destination, port, protocol, bytes, action) at the ENI, subnet, or VPC level. Does not capture packet payloads.
ENI (Elastic Network Interface)A virtual network interface attached to an EC2 instance. Flow Logs can be scoped to a specific ENI for targeted capture.
ENA (Elastic Network Adapter)The network driver required for Enhanced Networking on modern EC2 instance types. Must be enabled on both the instance and the AMI to reach higher throughput tiers.
Logs InsightsCloudWatch's query engine for log data, supporting aggregation, filtering, and time-series binning against log groups including VPC Flow Logs.

Related Posts

Comments

Popular posts from this blog

EC2 No Internet Access in Custom VPC: Fix Internet Gateway and Route Table

EC2 SSH Connection Timeout: Which Security Group Rules to Check

Difference Between IAM User and IAM Role: Which One Should Your EC2 Use?