SQS Visibility Timeout: Why Your Messages Are Being Processed Twice (And How to Fix It)

When multiple consumers are pulling from the same SQS queue, a subtle misconfiguration in Visibility Timeout can cause the same message to be processed by two different workers simultaneously — leading to duplicate side effects, corrupted state, or double-charged transactions in production systems.

TL;DR

Concept What It Means Impact of Getting It Wrong
Visibility Timeout Duration a message is hidden from other consumers after being received Too short → message reappears → duplicate processing
Default Value 30 seconds Fine for fast jobs; fatal for slow ones
Max Value 12 hours Too long delays retry on genuine consumer failure
The Fix Set timeout > your max processing time; extend dynamically if needed Eliminates race conditions between consumers

How SQS Visibility Timeout Actually Works

SQS is not a traditional message broker that "locks" a message to a consumer. Instead, it uses a lease-based visibility model. When Consumer A calls ReceiveMessage, SQS does two things atomically:

  • Returns the message payload to Consumer A.
  • Starts a countdown timer (the Visibility Timeout). During this window, the message is invisible to all other ReceiveMessage calls.

If Consumer A calls DeleteMessage before the timer expires, the message is permanently removed. If the timer expires first — whether because processing took too long, the consumer crashed, or the timeout was simply too short — SQS makes the message visible again, and any available consumer can pick it up. This is the root cause of your duplicate processing.

sequenceDiagram participant SQS as SQS Queue participant C1 as Consumer 1 participant C2 as Consumer 2 C1->>SQS: ReceiveMessage SQS-->>C1: Message M1 (Visibility Timer starts: 30s) Note over SQS: M1 is hidden from others Note over C1: Processing takes 45s... Note over SQS: Timer expires at 30s! SQS->>SQS: M1 becomes visible again C2->>SQS: ReceiveMessage SQS-->>C2: Message M1 (duplicate delivery!) C1->>SQS: DeleteMessage (too late - M1 already re-delivered) C2->>SQS: DeleteMessage

The Mechanics: What Controls the Timer

The Visibility Timeout can be set at two levels, with message-level overriding queue-level:

  • Queue-level default: Set via queue attributes. Applies to every ReceiveMessage call that doesn't specify an override.
  • Per-message override: Passed as VisibilityTimeout in the ReceiveMessage API call. Useful when different message types have different processing SLAs.
  • Dynamic extension: Call ChangeMessageVisibility mid-processing to reset the clock before it expires.

Real-World Analogy: Think of Visibility Timeout like a library book checkout. When you check out a book, it's removed from the shelf (invisible to others) for a fixed loan period. If you return it (DeleteMessage) before the due date, it's gone permanently. If you don't return it in time, the library puts it back on the shelf — and another patron can check it out. The library doesn't know or care that you still have the book at home. SQS behaves identically: it has no knowledge of your consumer's internal state, only the clock.

Diagnosing the Root Cause

The duplicate processing pattern almost always traces back to one of three failure modes:

  • Timeout too short: Your processing logic (DB writes, API calls, file transforms) takes longer than the configured timeout.
  • Consumer crash mid-flight: The consumer dies after receiving but before deleting. This is actually correct behavior — SQS is doing its job by retrying. Your processing logic must be idempotent.
  • Batch size mismatch: Receiving 10 messages but processing them serially — the first message's timeout expires while you're still on message 3.

Implementation: Setting and Extending Visibility Timeout

Use the AWS CLI to inspect and update your queue's current timeout:

# Check current visibility timeout
aws sqs get-queue-attributes \
  --queue-url https://sqs.us-east-1.amazonaws.com/123456789/my-queue \
  --attribute-names VisibilityTimeout

# Update queue-level timeout to 5 minutes
aws sqs set-queue-attributes \
  --queue-url https://sqs.us-east-1.amazonaws.com/123456789/my-queue \
  --attributes VisibilityTimeout=300

For long-running jobs, implement a heartbeat pattern — a background thread that periodically extends the visibility timeout while the main thread processes:

💻 [Click to expand] Python: Heartbeat Pattern for Long-Running SQS Consumers
import boto3
import threading
import time

sqs = boto3.client('sqs', region_name='us-east-1')
QUEUE_URL = 'https://sqs.us-east-1.amazonaws.com/123456789/my-queue'
EXTENSION_INTERVAL = 60   # extend every 60s
EXTENSION_AMOUNT = 120    # extend by 2 minutes each time

def heartbeat(queue_url, receipt_handle, stop_event):
    """Background thread: keeps message invisible while processing."""
    while not stop_event.is_set():
        time.sleep(EXTENSION_INTERVAL)
        if not stop_event.is_set():
            try:
                sqs.change_message_visibility(
                    QueueUrl=queue_url,
                    ReceiptHandle=receipt_handle,
                    VisibilityTimeout=EXTENSION_AMOUNT
                )
                print(f'[Heartbeat] Extended visibility by {EXTENSION_AMOUNT}s')
            except sqs.exceptions.MessageNotInflight:
                # Message was already deleted or timeout expired
                break

def process_message(message):
    """Simulate a long-running job (replace with real logic)."""
    print(f'Processing message: {message["MessageId"]}')
    time.sleep(90)  # Simulates a 90-second job
    print('Processing complete.')

def consume():
    response = sqs.receive_message(
        QueueUrl=QUEUE_URL,
        MaxNumberOfMessages=1,
        VisibilityTimeout=120  # Initial window: 2 minutes
    )

    messages = response.get('Messages', [])
    if not messages:
        print('Queue empty.')
        return

    message = messages[0]
    receipt_handle = message['ReceiptHandle']

    stop_event = threading.Event()
    hb_thread = threading.Thread(
        target=heartbeat,
        args=(QUEUE_URL, receipt_handle, stop_event),
        daemon=True
    )
    hb_thread.start()

    try:
        process_message(message)
        # Success: delete the message
        sqs.delete_message(QueueUrl=QUEUE_URL, ReceiptHandle=receipt_handle)
        print('Message deleted successfully.')
    except Exception as e:
        print(f'Processing failed: {e}. Message will reappear for retry.')
        # Do NOT delete — let SQS retry after timeout
    finally:
        stop_event.set()  # Signal heartbeat thread to stop
        hb_thread.join()

if __name__ == '__main__':
    consume()

Architecture: The Full Visibility Lifecycle

Message ProducerSQS QueueConsumer 1Consumer 2Dead Letter QueueIdempotency StoreMessage M1 SendMessageReceiveMessage(Visibility Timer starts)Check dedup keyNot seen beforeDeleteMessage(before timeout)Mark as processedTimer expired?Message reappearsCheck dedup keyAlready processed(skip)maxReceiveCountexceeded

Key Configuration Rules of Thumb

  • Set timeout = (max expected processing time) × 1.5 as a safety buffer.
  • Use the heartbeat pattern for any job that could exceed 5 minutes or has variable duration.
  • Design for idempotency regardless — consumer crashes are unavoidable. Your processing logic must produce the same result if executed twice (use a deduplication key in your DB or downstream service).
  • Configure a Dead Letter Queue (DLQ) with maxReceiveCount to catch messages that repeatedly fail, preventing infinite retry loops.
  • For FIFO queues: Visibility Timeout behaves identically, but the deduplication window (5 minutes) is a separate concept — don't conflate them.

IAM: Minimum Required Permissions

A consumer that needs to receive, extend, and delete messages requires only these actions — nothing broader:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": [
      "sqs:ReceiveMessage",
      "sqs:DeleteMessage",
      "sqs:ChangeMessageVisibility",
      "sqs:GetQueueAttributes"
    ],
    "Resource": "arn:aws:sqs:us-east-1:123456789:my-queue"
  }]
}

Wrap-up & Next Steps

The core insight: SQS Visibility Timeout is a lease, not a lock — if your consumer doesn't finish and delete the message before the clock runs out, SQS will hand that message to the next available worker, making idempotent processing and correctly-sized timeouts non-negotiable in any distributed consumer architecture.

Next Steps:

  • Audit your current queue timeout against your p99 processing latency in CloudWatch.
  • Implement the heartbeat pattern for any consumer with variable job duration.
  • Configure a DLQ to catch poison-pill messages that repeatedly exceed the timeout.
  • 📖 Official AWS Docs: SQS Visibility Timeout

Glossary

  • Visibility Timeout: The period (1s–12h) during which a received SQS message is hidden from other consumers, giving the current consumer time to process and delete it.
  • ReceiptHandle: A temporary, unique token returned with each ReceiveMessage call, required to delete or extend visibility of that specific message instance.
  • Idempotency: The property of an operation that produces the same result whether executed once or multiple times — essential for safe SQS message processing.
  • Dead Letter Queue (DLQ): A secondary SQS queue that receives messages exceeding a configured maxReceiveCount, isolating repeatedly-failing messages for inspection.
  • ChangeMessageVisibility: An SQS API action that resets the visibility timeout clock for an in-flight message, enabling the heartbeat pattern for long-running jobs.

Comments

Popular posts from this blog

EC2 No Internet Access in Custom VPC: Attaching an Internet Gateway and Fixing Route Tables

Lambda Infinite Loop with S3: How to Break the Recursive Trigger Cycle

IAM User vs. IAM Role: Why Your EC2 Instance Should Never Use a User