When to Use ElastiCache Redis: Fixing Slow RDS Read Performance

Your RDS instance is under siege — the same product catalog, user session, or leaderboard query fires hundreds of times per second, each one hitting the database cold. Adding an ElastiCache Redis layer intercepts those repetitive reads before they ever reach RDS, slashing response times from tens of milliseconds to sub-millisecond.

TL;DR

ScenarioWithout Redis CacheWith ElastiCache Redis
Repeated read queryHits RDS every timeServed from in-memory cache
Response latency10–100 ms (DB round-trip)Sub-millisecond cache hit
RDS CPU/loadScales with read trafficDramatically reduced
ScalabilityVertical scaling or read replicasHorizontal cache scaling
CostLarger RDS instance neededSmaller RDS + cache node

Why RDS Slows Down Under Repeated Reads

Relational databases are optimized for durability and consistency, not raw read throughput at scale. Every query — even a cache-friendly SELECT — incurs connection overhead, query parsing, buffer pool lookups, and network round-trips. When the same data is requested thousands of times per minute, you are paying that full cost repeatedly for zero new information.

The root cause is almost always one of three patterns:

  • Hot rows: A small subset of rows (e.g., top-10 products) accounts for the majority of reads.
  • Expensive aggregations: COUNT, SUM, or JOIN-heavy queries that are deterministic over short windows.
  • Session/token lookups: Auth tokens or user sessions validated on every API request.

The Cache-Aside Pattern: How Redis Intercepts Reads

The most common and operationally safe caching strategy is Cache-Aside (also called Lazy Loading). The application owns the cache logic — Redis is never written to directly by the database.

sequenceDiagram participant C as "Client" participant A as "Application" participant R as "Redis (ElastiCache)" participant D as "RDS Database" C->>A: "Read Request (e.g., GET /product/42)" A->>R: "GET product:42" alt "Cache Hit" R-->>A: "Returns cached JSON" A-->>C: "Response (sub-ms)" else "Cache Miss" R-->>A: "nil" A->>D: "SELECT * FROM products WHERE id=42" D-->>A: "Row data" A->>R: "SETEX product:42 300 '{...}'" A-->>C: "Response (DB latency)" end
  1. Client Request: The client sends a read request to the application.
  2. Cache Check: The application queries Redis first using a deterministic cache key (e.g., product:42).
  3. Cache Hit: If the key exists, Redis returns the value immediately. The application responds to the client — RDS is never touched.
  4. Cache Miss: If the key is absent (first request or TTL expired), the application falls through to RDS.
  5. Populate Cache: The application writes the RDS result back into Redis with a TTL, then responds to the client.
  6. Subsequent Requests: All future requests for the same key hit Redis until the TTL expires.
Analogy: Think of Redis as a whiteboard next to your desk. The first time a colleague asks you a complex question, you research the answer from the filing cabinet (RDS) and write it on the whiteboard. Every subsequent colleague asking the same question gets the answer from the whiteboard instantly — the filing cabinet stays closed.

Cache Invalidation: Keeping Data Consistent

Stale cache data is the primary operational risk. The application must invalidate or update the cache whenever the underlying data changes. The application acts as the orchestrator for both the write to RDS and the subsequent cache invalidation.

sequenceDiagram participant C as "Client" participant A as "Application" participant D as "RDS Database" participant R as "Redis (ElastiCache)" C->>A: "Write Request (e.g., PUT /product/42)" A->>D: "UPDATE products SET ... WHERE id=42" D-->>A: "Write Confirmed" A->>R: "DEL product:42" R-->>A: "Key Deleted" A-->>C: "200 OK" Note over A,R: "Next read for product:42 will be a cache miss,
fetching fresh data from RDS and repopulating Redis"
  1. Write Request: A mutation (INSERT/UPDATE/DELETE) arrives at the application.
  2. Persist to RDS: The application writes the authoritative data to RDS first. This ensures durability before any cache operation.
  3. Invalidate Cache: After a successful RDS write, the application issues a DEL command to Redis, removing the now-stale key.
  4. Next Read Repopulates: The next read request for that key will be a cache miss, fetch fresh data from RDS, and re-populate Redis.

Why delete instead of update? Deleting is safer than writing the new value directly to Redis after a write. It avoids race conditions where two concurrent writes could populate the cache with an older value.

AWS Infrastructure Architecture

In a production AWS environment, ElastiCache Redis and RDS reside in private subnets. Security Groups enforce least-privilege network access — only the application tier can reach the cache, and only the application tier can reach the database.

graph LR subgraph "VPC" subgraph "Public Subnet" ALB["Application
Load Balancer"] end subgraph "Private Subnet - App Tier" APP["Application
(EC2 / ECS / Lambda)
AppSG"] end subgraph "Private Subnet - Data Tier" REDIS["ElastiCache Redis
Replication Group
CacheSG :6379"] RDS["Amazon RDS
(Primary + Standby)
RDSSG :5432"] end end Internet(["Internet"]) --> ALB ALB --> APP APP -->|"Inbound: AppSG only"| REDIS APP -->|"Inbound: AppSG only"| RDS
  1. VPC Isolation: All components live inside a VPC. ElastiCache and RDS are in private subnets with no public internet access.
  2. AppSG (Application Security Group): Attached to your EC2/ECS/Lambda compute. It is the only source allowed to connect to both the cache and the database.
  3. CacheSG: Allows inbound TCP on port 6379 only from AppSG. No other traffic is permitted.
  4. RDSSG: Allows inbound TCP on port 5432 (PostgreSQL) or 3306 (MySQL) only from AppSG.
  5. No direct Cache-to-RDS path: The cache and database have no network path between them. The application is the sole orchestrator.

Implementation: Python (boto3 + redis-py)

The following snippet demonstrates the Cache-Aside pattern with TTL-based expiry. Replace the placeholder connection strings with your actual ElastiCache and RDS endpoints.

🔽 [Click to expand] cache_aside.py — Cache-Aside Pattern Implementation
import redis
import psycopg2
import json
import os

# --- Connection Setup ---
# Retrieve endpoints from environment variables (never hardcode)
REDIS_HOST = os.environ["ELASTICACHE_ENDPOINT"]  # e.g., my-cluster.abc123.ng.0001.use1.cache.amazonaws.com
REDIS_PORT = 6379
CACHE_TTL_SECONDS = 300  # 5-minute TTL

RDS_HOST = os.environ["RDS_ENDPOINT"]
RDS_DB   = os.environ["RDS_DB_NAME"]
RDS_USER = os.environ["RDS_USER"]
RDS_PASS = os.environ["RDS_PASSWORD"]

# Initialize Redis client (use SSL for production ElastiCache)
cache = redis.Redis(
    host=REDIS_HOST,
    port=REDIS_PORT,
    ssl=True,          # Required for ElastiCache in-transit encryption
    decode_responses=True
)

def get_db_connection():
    return psycopg2.connect(
        host=RDS_HOST, dbname=RDS_DB,
        user=RDS_USER, password=RDS_PASS
    )

# --- Cache-Aside Read ---
def get_product(product_id: int) -> dict:
    cache_key = f"product:{product_id}"

    # 1. Check cache first
    cached_value = cache.get(cache_key)
    if cached_value:
        print(f"[CACHE HIT] {cache_key}")
        return json.loads(cached_value)

    # 2. Cache miss — query RDS
    print(f"[CACHE MISS] {cache_key} — querying RDS")
    conn = get_db_connection()
    try:
        with conn.cursor() as cur:
            cur.execute("SELECT id, name, price FROM products WHERE id = %s", (product_id,))
            row = cur.fetchone()
            if not row:
                return None
            product = {"id": row[0], "name": row[1], "price": float(row[2])}
    finally:
        conn.close()

    # 3. Populate cache with TTL
    cache.setex(cache_key, CACHE_TTL_SECONDS, json.dumps(product))
    return product

# --- Write-Through Invalidation ---
def update_product(product_id: int, name: str, price: float) -> None:
    cache_key = f"product:{product_id}"

    # 1. Write to RDS first (source of truth)
    conn = get_db_connection()
    try:
        with conn.cursor() as cur:
            cur.execute(
                "UPDATE products SET name = %s, price = %s WHERE id = %s",
                (name, price, product_id)
            )
        conn.commit()
    finally:
        conn.close()

    # 2. Invalidate cache AFTER successful DB write
    deleted = cache.delete(cache_key)
    print(f"[CACHE INVALIDATED] {cache_key} — deleted={deleted}")

IAM & Security: Least Privilege for ElastiCache

ElastiCache Redis access is controlled at the network layer (Security Groups) and, for ElastiCache with AUTH or RBAC enabled, at the application layer. For the AWS control plane (creating/describing clusters), your application's IAM role should follow least privilege.

🔽 [Click to expand] IAM Policy — Least Privilege for ElastiCache Describe Operations
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ElastiCacheReadOnly",
      "Effect": "Allow",
      "Action": [
        "elasticache:DescribeCacheClusters",
        "elasticache:DescribeReplicationGroups"
      ],
      "Resource": "arn:aws:elasticache:us-east-1:123456789012:replicationgroup:my-redis-cluster"
    }
  ]
}

Key security practices for ElastiCache Redis in production:

  • Enable in-transit encryption (TLS) and at-rest encryption when creating the cluster.
  • Enable Redis AUTH or use ElastiCache RBAC (Role-Based Access Control) for user-level authentication.
  • Place ElastiCache in a private subnet — never expose port 6379 to the internet.
  • Restrict Security Group ingress to only the application's Security Group ID, not a CIDR block.

Choosing the Right TTL Strategy

Data TypeRecommended TTLInvalidation Trigger
Product catalog5–15 minutesOn product update event
User session / auth tokenMatch session expiryOn logout or token revocation
Leaderboard / aggregation30–60 secondsTTL expiry (acceptable staleness)
Configuration / feature flags1–5 minutesOn config change deployment
Real-time inventoryVery short or no cacheNot suitable for caching

When ElastiCache Redis Is NOT the Answer

  • Write-heavy workloads: If your traffic is predominantly writes, a cache adds invalidation overhead with minimal read benefit.
  • Highly dynamic data: Data that changes on every request (e.g., real-time stock prices) will have a near-zero cache hit rate.
  • Missing indexes: If your RDS slowness is caused by missing indexes or unoptimized queries, fix the query first. A cache on top of a broken query is a band-aid.
  • Strong consistency requirements: If your application cannot tolerate even seconds of stale data, cache-aside with TTL may not be appropriate without a robust invalidation strategy.

Wrap-Up & Next Steps

ElastiCache Redis is the right tool when your RDS pain is caused by repetitive reads of relatively stable data. The Cache-Aside pattern keeps your application in control, your database load low, and your response times in the sub-millisecond range. Start with a single-node cluster for development, then graduate to a Multi-AZ replication group for production resilience.

Glossary

TermDefinition
Cache-Aside (Lazy Loading)A caching pattern where the application checks the cache before querying the database, and populates the cache on a miss.
TTL (Time-To-Live)An expiry duration set on a cache key after which Redis automatically evicts it, forcing a fresh database read.
Cache Hit / Cache MissA hit means the requested key was found in Redis. A miss means it was absent and the database must be queried.
Cache InvalidationThe process of removing or updating a stale cache entry after the underlying data changes in the database.
Replication GroupAn ElastiCache construct that manages a primary Redis node and one or more read replicas across Availability Zones for high availability.

Comments

Popular posts from this blog

IAM User vs. IAM Role: Why Your EC2 Instance Should Never Use a User

EC2 No Internet Access in Custom VPC: Attaching an Internet Gateway and Fixing Route Tables

Lambda Infinite Loop with S3: How to Prevent Recursive Triggers