SQS Visibility Timeout: Why Your Messages Are Being Processed Twice (And How to Fix It)
When multiple consumers are pulling from the same SQS queue, a subtle misconfiguration in Visibility Timeout can cause the same message to be processed by two different workers simultaneously — leading to duplicate side effects, corrupted state, or double-charged transactions in production systems.
TL;DR
| Concept | What It Means | Impact of Getting It Wrong |
|---|---|---|
| Visibility Timeout | Duration a message is hidden from other consumers after being received | Too short → message reappears → duplicate processing |
| Default Value | 30 seconds | Fine for fast jobs; fatal for slow ones |
| Max Value | 12 hours | Too long delays retry on genuine consumer failure |
| The Fix | Set timeout > your max processing time; extend dynamically if needed | Eliminates race conditions between consumers |
How SQS Visibility Timeout Actually Works
SQS is not a traditional message broker that "locks" a message to a consumer. Instead, it uses a lease-based visibility model. When Consumer A calls ReceiveMessage, SQS does two things atomically:
- Returns the message payload to Consumer A.
- Starts a countdown timer (the Visibility Timeout). During this window, the message is invisible to all other
ReceiveMessagecalls.
If Consumer A calls DeleteMessage before the timer expires, the message is permanently removed. If the timer expires first — whether because processing took too long, the consumer crashed, or the timeout was simply too short — SQS makes the message visible again, and any available consumer can pick it up. This is the root cause of your duplicate processing.
The Mechanics: What Controls the Timer
The Visibility Timeout can be set at two levels, with message-level overriding queue-level:
- Queue-level default: Set via queue attributes. Applies to every
ReceiveMessagecall that doesn't specify an override. - Per-message override: Passed as
VisibilityTimeoutin theReceiveMessageAPI call. Useful when different message types have different processing SLAs. - Dynamic extension: Call
ChangeMessageVisibilitymid-processing to reset the clock before it expires.
Real-World Analogy: Think of Visibility Timeout like a library book checkout. When you check out a book, it's removed from the shelf (invisible to others) for a fixed loan period. If you return it (DeleteMessage) before the due date, it's gone permanently. If you don't return it in time, the library puts it back on the shelf — and another patron can check it out. The library doesn't know or care that you still have the book at home. SQS behaves identically: it has no knowledge of your consumer's internal state, only the clock.
Diagnosing the Root Cause
The duplicate processing pattern almost always traces back to one of three failure modes:
- Timeout too short: Your processing logic (DB writes, API calls, file transforms) takes longer than the configured timeout.
- Consumer crash mid-flight: The consumer dies after receiving but before deleting. This is actually correct behavior — SQS is doing its job by retrying. Your processing logic must be idempotent.
- Batch size mismatch: Receiving 10 messages but processing them serially — the first message's timeout expires while you're still on message 3.
Implementation: Setting and Extending Visibility Timeout
Use the AWS CLI to inspect and update your queue's current timeout:
# Check current visibility timeout
aws sqs get-queue-attributes \
--queue-url https://sqs.us-east-1.amazonaws.com/123456789/my-queue \
--attribute-names VisibilityTimeout
# Update queue-level timeout to 5 minutes
aws sqs set-queue-attributes \
--queue-url https://sqs.us-east-1.amazonaws.com/123456789/my-queue \
--attributes VisibilityTimeout=300
For long-running jobs, implement a heartbeat pattern — a background thread that periodically extends the visibility timeout while the main thread processes:
💻 [Click to expand] Python: Heartbeat Pattern for Long-Running SQS Consumers
import boto3
import threading
import time
sqs = boto3.client('sqs', region_name='us-east-1')
QUEUE_URL = 'https://sqs.us-east-1.amazonaws.com/123456789/my-queue'
EXTENSION_INTERVAL = 60 # extend every 60s
EXTENSION_AMOUNT = 120 # extend by 2 minutes each time
def heartbeat(queue_url, receipt_handle, stop_event):
"""Background thread: keeps message invisible while processing."""
while not stop_event.is_set():
time.sleep(EXTENSION_INTERVAL)
if not stop_event.is_set():
try:
sqs.change_message_visibility(
QueueUrl=queue_url,
ReceiptHandle=receipt_handle,
VisibilityTimeout=EXTENSION_AMOUNT
)
print(f'[Heartbeat] Extended visibility by {EXTENSION_AMOUNT}s')
except sqs.exceptions.MessageNotInflight:
# Message was already deleted or timeout expired
break
def process_message(message):
"""Simulate a long-running job (replace with real logic)."""
print(f'Processing message: {message["MessageId"]}')
time.sleep(90) # Simulates a 90-second job
print('Processing complete.')
def consume():
response = sqs.receive_message(
QueueUrl=QUEUE_URL,
MaxNumberOfMessages=1,
VisibilityTimeout=120 # Initial window: 2 minutes
)
messages = response.get('Messages', [])
if not messages:
print('Queue empty.')
return
message = messages[0]
receipt_handle = message['ReceiptHandle']
stop_event = threading.Event()
hb_thread = threading.Thread(
target=heartbeat,
args=(QUEUE_URL, receipt_handle, stop_event),
daemon=True
)
hb_thread.start()
try:
process_message(message)
# Success: delete the message
sqs.delete_message(QueueUrl=QUEUE_URL, ReceiptHandle=receipt_handle)
print('Message deleted successfully.')
except Exception as e:
print(f'Processing failed: {e}. Message will reappear for retry.')
# Do NOT delete — let SQS retry after timeout
finally:
stop_event.set() # Signal heartbeat thread to stop
hb_thread.join()
if __name__ == '__main__':
consume()
Architecture: The Full Visibility Lifecycle
Key Configuration Rules of Thumb
- Set timeout = (max expected processing time) × 1.5 as a safety buffer.
- Use the heartbeat pattern for any job that could exceed 5 minutes or has variable duration.
- Design for idempotency regardless — consumer crashes are unavoidable. Your processing logic must produce the same result if executed twice (use a deduplication key in your DB or downstream service).
- Configure a Dead Letter Queue (DLQ) with
maxReceiveCountto catch messages that repeatedly fail, preventing infinite retry loops. - For FIFO queues: Visibility Timeout behaves identically, but the deduplication window (5 minutes) is a separate concept — don't conflate them.
IAM: Minimum Required Permissions
A consumer that needs to receive, extend, and delete messages requires only these actions — nothing broader:
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": [
"sqs:ReceiveMessage",
"sqs:DeleteMessage",
"sqs:ChangeMessageVisibility",
"sqs:GetQueueAttributes"
],
"Resource": "arn:aws:sqs:us-east-1:123456789:my-queue"
}]
}
Wrap-up & Next Steps
The core insight: SQS Visibility Timeout is a lease, not a lock — if your consumer doesn't finish and delete the message before the clock runs out, SQS will hand that message to the next available worker, making idempotent processing and correctly-sized timeouts non-negotiable in any distributed consumer architecture.
Next Steps:
- Audit your current queue timeout against your p99 processing latency in CloudWatch.
- Implement the heartbeat pattern for any consumer with variable job duration.
- Configure a DLQ to catch poison-pill messages that repeatedly exceed the timeout.
- 📖 Official AWS Docs: SQS Visibility Timeout
Glossary
- Visibility Timeout: The period (1s–12h) during which a received SQS message is hidden from other consumers, giving the current consumer time to process and delete it.
- ReceiptHandle: A temporary, unique token returned with each
ReceiveMessagecall, required to delete or extend visibility of that specific message instance. - Idempotency: The property of an operation that produces the same result whether executed once or multiple times — essential for safe SQS message processing.
- Dead Letter Queue (DLQ): A secondary SQS queue that receives messages exceeding a configured
maxReceiveCount, isolating repeatedly-failing messages for inspection. - ChangeMessageVisibility: An SQS API action that resets the visibility timeout clock for an in-flight message, enabling the heartbeat pattern for long-running jobs.
Comments
Post a Comment