Scaling Reads with RDS Read Replicas: Architecture, Setup, and the Multi-AZ Difference
When your RDS instance starts showing high CPU utilization and query latency spikes during peak traffic, the root cause is often a flood of SELECT statements competing with write operations on a single database endpoint. Scaling reads with RDS Read Replicas is the standard architectural response — but knowing when to use a replica versus Multi-AZ, and how to wire your application correctly, is where most teams stumble.
TL;DR: RDS Read Replicas vs. Multi-AZ at a Glance
| Dimension | Read Replica | Multi-AZ |
|---|---|---|
| Primary purpose | Read scalability | High availability / failover |
| Serves read traffic? | Yes — dedicated endpoint | No — standby is not queryable |
| Replication type | Asynchronous | Synchronous |
| Failover promotion | Manual (or Aurora automatic) | Automatic |
| Separate endpoint? | Yes | No — same endpoint, DNS flips |
| Cross-region support | Yes | No (same region, different AZ) |
How RDS Read Replica Scaling Works
RDS Read Replicas use the database engine's native asynchronous replication to stream changes from the primary instance to one or more replica instances. For MySQL and MariaDB, this is binary log (binlog) replication. For PostgreSQL, it is streaming replication via write-ahead log (WAL). Oracle and SQL Server use their own engine-specific mechanisms. Because replication is asynchronous, replicas may lag behind the primary by a measurable amount — this is replica lag, and it is the central constraint your application design must account for.
Each replica gets its own DNS endpoint. Your application must be explicitly configured to route read queries to that endpoint. RDS does not automatically redirect reads — the routing decision lives entirely in your application layer or connection proxy.
(INSERT/UPDATE/DELETE)"| Primary["RDS Primary
Instance"] App -->|"Reads
(SELECT)"| Replica1["Read Replica 1
Async endpoint"] App -->|"Reads
(SELECT)"| Replica2["Read Replica 2
Async endpoint"] Primary -->|"Async replication
(WAL / binlog)"| Replica1 Primary -->|"Async replication
(WAL / binlog)"| Replica2 Primary -->|"Sync replication"| Standby["Multi-AZ Standby
(not queryable)"] style Standby fill:#f5f5f5,stroke:#999,stroke-dasharray:5 5 style Primary fill:#d4edda,stroke:#28a745 style Replica1 fill:#cce5ff,stroke:#004085 style Replica2 fill:#cce5ff,stroke:#004085
- Write path: All
INSERT,UPDATE,DELETEoperations go to the primary instance endpoint. - Replication stream: The primary asynchronously ships WAL/binlog changes to each replica. Replica lag is observable via the
ReplicaLagCloudWatch metric. - Read path:
SELECTqueries are routed to a replica endpoint by the application or a proxy layer (e.g., RDS Proxy). - Multi-AZ standby: The synchronous standby receives every write before the primary acknowledges it — but it never serves read traffic. It exists solely for automatic failover.
Creating a Read Replica: Step-by-Step
Step 1: Verify Automated Backups Are Enabled
Read Replica creation requires automated backups to be enabled on the source instance — RDS uses the backup mechanism to seed the initial replica snapshot. If backups are disabled, the create-db-instance-read-replica call will fail. This is the most common reason teams hit an error on their first attempt and waste time looking at network or IAM issues instead.
aws rds describe-db-instances \
--db-instance-identifier my-primary-db \
--query 'DBInstances[0].BackupRetentionPeriod' \
--region us-east-1
A return value of 0 means backups are off. Enable them before proceeding:
aws rds modify-db-instance \
--db-instance-identifier my-primary-db \
--backup-retention-period 7 \
--apply-immediately \
--region us-east-1
Step 2: Create the Read Replica
With backups confirmed, create the replica. Choose an instance class that matches or exceeds the read workload profile — under-sizing the replica is a common mistake that simply moves the bottleneck rather than eliminating it.
aws rds create-db-instance-read-replica \
--db-instance-identifier my-primary-db-replica-1 \
--source-db-instance-identifier my-primary-db \
--db-instance-class db.r6g.large \
--availability-zone us-east-1b \
--publicly-accessible false \
--region us-east-1
The replica enters creating state, then available. Provisioning time depends on database size.
Step 3: Retrieve the Replica Endpoint
Once available, retrieve the replica's endpoint address. This is the value your application's read connection string must point to.
aws rds describe-db-instances \
--db-instance-identifier my-primary-db-replica-1 \
--query 'DBInstances[0].Endpoint.Address' \
--region us-east-1
Step 4: Monitor Replica Lag
Replica lag is the operational heartbeat of your read scaling strategy. A rising lag means the replica is falling behind — reads from it may return stale data. Wire this metric into your alerting before routing production traffic.
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name ReplicaLag \
--dimensions Name=DBInstanceIdentifier,Value=my-primary-db-replica-1 \
--start-time 2024-01-01T00:00:00Z \
--end-time 2024-01-01T01:00:00Z \
--period 60 \
--statistics Average \
--region us-east-1
Step 5: Configure Security Group Access
The replica inherits the VPC of the source but does not automatically inherit security group rules for your application tier. Verify that the application's security group has an inbound rule permitting access to the replica's port. Forgetting this step results in connection timeouts that look identical to an endpoint misconfiguration.
aws ec2 authorize-security-group-ingress \
--group-id sg-0replica1234567890 \
--protocol tcp \
--port 5432 \
--source-group sg-0app1234567890 \
--region us-east-1
IAM Policy for Read Replica Management
If your deployment pipeline or application needs to describe or manage replicas programmatically, scope the policy to the minimum required actions. rds:DescribeDBInstances requires "Resource": "*" — this is a documented constraint in the RDS Service Authorization Reference.
🔽 Click to expand: IAM policy for replica management
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DescribeReplicas",
"Effect": "Allow",
"Action": [
"rds:DescribeDBInstances",
"rds:DescribeDBLogFiles"
],
"Resource": "*"
},
{
"Sid": "ManageSpecificReplica",
"Effect": "Allow",
"Action": [
"rds:CreateDBInstanceReadReplica",
"rds:ModifyDBInstance",
"rds:DeleteDBInstance",
"rds:RebootDBInstance"
],
"Resource": "arn:aws:rds:us-east-1:123456789012:db:my-primary-db-replica-1"
}
]
}
The Misdiagnosis That Costs Hours: Replica Lag Under Write Bursts
Here is a failure pattern that appears in production more often than it should. The symptom: after routing reads to the replica, users intermittently see stale or missing records — records they just wrote. The initial diagnosis is almost always a bug in the application's read routing logic. Engineers spend time auditing connection strings, checking ORM configurations, and verifying endpoint DNS resolution.
The actual cause is replica lag spiking during write bursts. The application writes to the primary and immediately reads from the replica — but the replica hasn't yet received the change. The read returns the pre-write state. This is not a bug. It is the expected behavior of asynchronous replication, and it surfaces specifically when the application assumes read-your-own-writes consistency across two different endpoints.
The fix has two parts. First, for operations where a user must immediately read what they just wrote, route that specific read back to the primary endpoint. Second, add lag-aware logic: if ReplicaLag exceeds a threshold, temporarily redirect all reads to the primary until the replica catches up. RDS Proxy can simplify this routing, but the consistency model decision must be made at the application design level — no proxy resolves it automatically.
A Read Replica is not a cache — it is an eventually consistent copy. Design your read routing with the same care you'd apply to any distributed system where writes and reads can diverge.
Depth: The Interaction Between Multi-AZ and Read Replicas
A non-obvious behavior: when you enable Multi-AZ on a source instance that already has Read Replicas, replication to the replicas continues from the primary — not from the standby. However, during a Multi-AZ failover, the old primary is replaced by the standby, and RDS automatically re-points the replication source for existing replicas to the new primary. This re-pointing can cause a brief increase in replica lag immediately after failover as the new primary re-establishes the replication stream. If your application has tight lag thresholds in its routing logic, a Multi-AZ failover event can temporarily trigger a fallback to primary-only reads — which is the correct behavior, but it must be anticipated in capacity planning.
during re-establishment
- Normal state: Primary serves writes; standby receives synchronous replication; replicas receive asynchronous replication from the primary.
- Failover triggered: Primary becomes unavailable; RDS promotes the standby to primary (DNS endpoint unchanged).
- Post-failover: Replicas re-establish their replication stream from the new primary. A transient lag spike is expected during this re-establishment window.
How Many Read Replicas Can You Create?
For MySQL, MariaDB, and PostgreSQL RDS instances, AWS supports up to 5 Read Replicas per source instance. Aurora supports up to 15 Aurora Replicas within a cluster. Pricing and exact limits vary — always check the official AWS RDS documentation for current quotas before designing your scaling architecture.
Stacking replicas beyond what your replication bandwidth supports will increase lag across all replicas, not just the newest one. Monitor ReplicaLag on all replicas when adding new ones — a rising baseline lag across the fleet is the signal that you've reached the replication throughput ceiling for that instance class.
Wrap-Up: Scaling Reads with RDS Read Replicas — Next Steps
Read Replicas solve a specific problem: too many SELECT queries saturating a single instance. They do not solve high availability — that is Multi-AZ's job. The two features are complementary and can be used together. The operational discipline required is routing reads correctly, monitoring replica lag continuously, and designing your application's consistency model before the first replica goes live.
For teams running connection-heavy workloads, evaluate RDS Proxy as a connection pooling and read/write splitting layer in front of your replicas. For Aurora-based workloads, Aurora's reader endpoint automatically load-balances across all Aurora Replicas, removing the need to manage individual replica endpoints in application code.
Official reference: AWS RDS Read Replicas documentation.
Glossary
| Term | Definition |
|---|---|
| Read Replica | A read-only copy of an RDS instance that receives changes asynchronously from the source and serves SELECT traffic via its own endpoint. |
| Replica Lag | The delay between a write being committed on the primary and that change appearing on the replica. Measured in seconds via the ReplicaLag CloudWatch metric. |
| Multi-AZ | An RDS configuration that maintains a synchronous standby in a different Availability Zone for automatic failover. The standby does not serve read traffic. |
| Asynchronous Replication | A replication model where the primary acknowledges a write before confirming the replica has received it. Enables lower write latency but allows temporary data divergence. |
| Read-Your-Own-Writes Consistency | A consistency guarantee where a client that performs a write can immediately read that write back. Not guaranteed when reads and writes target different endpoints with asynchronous replication. |
Comments
Post a Comment