
The Problem
You’ve built a nice little API using AWS API Gateway - serving as a web-hook endpoint for an external Order Management System. It takes orders from your web app, saves them to a database, and calls a few downstream services. It runs through this stack: API Gateway → Lambda → Database.
It works perfectly on a normal day.
Then one day…
- Your marketing team runs a big sale.
- Or a partner’s integration script gets stuck in a loop.
- Or a new feature upstream pushes thousands of requests by mistake.
Suddenly you’re facing an event storm - a huge, unexpected spike in requests hitting your API all at once.
What’s an Event Storm?
An event storm is when the number of incoming requests in a short time is way higher than your system normally handles. It’s not a slow increase. It’s a flood.
Example triggers:
- Black Friday or Diwali sale on your e-commerce site.
- A cron job gone rogue sending thousands of API calls per second.
- A botnet hammering your public endpoint.
- A sudden surge in user signups after media coverage.
Quick note on Lambda scaling: Lambda doesn’t go from 0 → 100K concurrent executions instantly. Each region has a burst concurrency limit (for example ~3000 in us-east-1) and then scales up by about 500–1000 per minute.
So when I say “100K requests hit,” I mean the inbound spike hitting API Gateway, not the actual concurrency Lambda reached. The scaling ramp-up actually makes the need for a buffer more obvious -SQS can absorb the full burst immediately, while Lambda drains at the pace it can realistically scale to.
Why It’s a Problem in AWS Lambda
AWS Lambda scales fast, which sounds great until you realise:
- Every Lambda execution counts toward your account concurrency limit.
- If you hit that limit, all other Lambdas in your account may start failing too.
- Downstream systems (like a database) may crash from too many parallel requests.
With API Gateway directly invoking Lambda, an event storm can burn through concurrency in seconds.
The Safer Approach
We’ll redesign the flow so that:
- API Gateway sends requests to SQS instead of Lambda directly.
- If SQS accepts it, API Gateway returns 202 Accepted with an ack id (SQS
MessageId).
HTTP/1.1 202 Accepted{"status":"accepted","messageId":"c1c7...","receivedAt":1735296623123}
- Processing happens later via the consumer Lambda.
- Lambda reads from SQS in controlled batches and upserts into database
- Concurrency is capped so we never overload the database.
- Failed messages go to a DLQ for later reprocessing.
This way, the storm is buffered in SQS, and you can process at a steady pace without losing data.
Architecture Overview
- API Gateway -> receives incoming HTTP requests.
- SQS (Standard Queue) -> buffers all requests.
- Lambda Consumer -> pulls messages from SQS in small batches.
- DLQ -> stores failed messages for later replay.
Don’t forget about caching at the API layer. If your API is serving repetitive data (like catalog lookups, product info, or configuration files), you can enable API Gateway caching. This stores responses in an in-memory cache so repeated requests don’t even reach Lambda or SQS. In my case the requests were mostly unique and user-specific, so caching didn’t help much. But for many APIs, caching at API GW can drastically reduce Lambda concurrency and costs.
Step 0: Recommended Starting Settings
You can tweak these later, but start here:
- Lambda timeout: 30s
- Lambda reserved concurrency (the max number of concurrent executions this function can ever use): 50
- SQS batch size: 10
- Max batching window (how long Lambda waits to gather more records before invoking the function): 2s
- SQS visibility timeout (the time a message stays hidden after being picked up, before it can be retried): 90s
- DLQ maxReceiveCount (the number of times a message can fail before being sent to the DLQ): 5
Why these help:
- Max parallel work = 50 (Lambda reserved concurrency**)** × 10 (SQS batch size**)** = 500 messages at once.
- SQS buffers the rest during a spike.
- Visibility timeout prevents duplicate processing.
- DLQ ensures nothing is lost.
Step 1: Create SQS and DLQ
Create DLQ
aws sqs create-queue --queue-name orders-dlq
Create main queue with DLQ attached
DLQ_URL=$(aws sqs get-queue-url --queue-name orders-dlq --query 'QueueUrl' --output text)DLQ_ARN=$(aws sqs get-queue-attributes --queue-url $DLQ_URL --attribute-names QueueArn --query 'Attributes.QueueArn' --output text)aws sqs create-queue \ --queue-name orders \ --attributes '{ "RedrivePolicy":"{\"deadLetterTargetArn\":\"'"$DLQ_ARN"'\",\"maxReceiveCount\":\"5\"}", "VisibilityTimeout":"90" }'
Step 2: Connect API Gateway to SQS
Option A: Direct integration
- No Lambda in between.
- API Gateway sends request body directly to SQS.
IAM role for API Gateway
{ "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Action": ["sqs:SendMessage"], "Resource": "arn:aws:sqs:<REGION>:<ACCOUNT_ID>:orders" }]}
Option B: Producer Lambda
If you need validation/auth before pushing to SQS:
// producer/index.mjsimport { SQSClient, SendMessageCommand } from "@aws-sdk/client-sqs";const sqs = new SQSClient({});const QUEUE_URL = process.env.QUEUE_URL;export const handler = async (event) => { const body = JSON.parse(event.body || "{}"); if (!body.orderId) { return { statusCode: 400, body: JSON.stringify({ error: "orderId required" }) }; } await sqs.send(new SendMessageCommand({ QueueUrl: QUEUE_URL, MessageBody: JSON.stringify(body) })); return { statusCode: 202, body: JSON.stringify({ status: "accepted" }) };};
API Gateway caching reminder: If traffic includes repeated requests, enable API Gateway caching. It cuts load before it even reaches SQS or Lambda.
Step 3: Create the Consumer Lambda
// consumer/index.mjsexport const handler = async (event) => { for (const record of event.Records) { try { const msg = JSON.parse(record.body); await processOrder(msg); } catch (err) { console.error("Failed for messageId", record.messageId, err); throw err; // forces SQS to retry or send to DLQ } }};async function processOrder(msg) { await new Promise(r => setTimeout(r, 150)); // simulate work}
Event Source Mapping
- Batch size: 10
- Max batching window: 2s
Reserved concurrency: 50 (prevents overload)
Step 3.1: Don’t kill your database connections
One hidden problem when Lambda talks to RDS: each Lambda instance may open a new DB connection. During a spike, you can exhaust DB connections fast.
The fix is to use RDS Proxy. It sits between Lambda and the database, pools connections, and reuses them across Lambda invocations. This keeps your database safe from connection storms.
Alternatives:
- For PostgreSQL/MySQL, some teams also run PgBouncer on ECS/Fargate.
- Or, for workloads with unpredictable spikes, consider DynamoDB which doesn’t have connection limits.
Step 4: Tune for Storms
- Visibility timeout ≥ 3 × Lambda timeout.
- Do throughput math:
parallelism = reserved_concurrency × batch_sizerecords/sec ≈ parallelism / avg_processing_seconds
Step 5: Alerts
Queue depth alarm
aws cloudwatch put-metric-alarm \ --alarm-name "orders-queue-depth-high" \ --metric-name ApproximateNumberOfMessagesVisible \ --namespace AWS/SQS \ --dimensions Name=QueueName,Value=orders \ --statistic Average \ --period 60 \ --threshold 50000 \ --comparison-operator GreaterThanThreshold \ --evaluation-periods 5 \ --alarm-actions <SNS_TOPIC_ARN>
Oldest message age alarm -> Warns you when messages have been waiting too long in the queue (in seconds).
aws cloudwatch put-metric-alarm \ --alarm-name "orders-oldest-message-too-old" \ --namespace "AWS/SQS" \ --metric-name "ApproximateAgeOfOldestMessage" \ --dimensions "Name=QueueName,Value=orders" \ --statistic "Maximum" \ --period 60 \ --threshold 600 \ # warn if oldest message > 10 minutes old --comparison-operator "GreaterThanThreshold" \ --evaluation-periods 5 \ # for 5 minutes if period=60 --treat-missing-data "notBreaching" \ --unit "Seconds" \ --alarm-actions "<SNS_TOPIC_ARN>"
Step 6: DLQ Replay
Fix the bug → replay messages from DLQ to main queue with a small Lambda.
import { SQSClient, ReceiveMessageCommand, DeleteMessageCommand, SendMessageCommand } from "@aws-sdk/client-sqs";const sqs = new SQSClient({});const DLQ_URL = process.env.DLQ_URL;const MAIN_URL = process.env.MAIN_URL;export const handler = async () => { const res = await sqs.send(new ReceiveMessageCommand({ QueueUrl: DLQ_URL, MaxNumberOfMessages: 10, WaitTimeSeconds: 1 })); if (!res.Messages) return; for (const m of res.Messages) { await sqs.send(new SendMessageCommand({ QueueUrl: MAIN_URL, MessageBody: m.Body })); await sqs.send(new DeleteMessageCommand({ QueueUrl: DLQ_URL, ReceiptHandle: m.ReceiptHandle })); }};
- Pick
THRESHOLD_SECONDSwell below your SQS message retention (default 4 days) but high enough to avoid noise. Start with 600-1800 seconds for most APIs, or align it with your SLA. - Use
Maximum(notAverage) - you care about the single oldest message. - Pair this with the queue depth alarm you already have.
Important Pointers..
- Don’t let Lambda’s scaling run wild in a spike.
- Do buffer in SQS and process at a pace your system can handle.
- Always set a DLQ - losing data is worse than slow processing.
- Monitor queue depth and message age - that’s your early warning.
- Use RDS Proxy when Lambdas talk to relational databases.
- Add API Gateway caching if traffic is repetitive.
One last note,
This whole design works best for asynchronous APIs-places where you can acknowledge a request and process it later.
But what if your API is synchronous and the client needs a response right away? In that case, queues won’t help. You need other tools like rate limiting, provisioned concurrency, and sometimes containers.
I wrote a follow-up on handling synchronous API traffic spikes here: https://medium.com/aws-in-plain-english/surviving-traffic-surges-in-sync-apis-rate-limits-warm-lambdas-and-smart-scaling-d04488ad94db?sk=6a2f4645f254fd28119b2f5ab263269d
Together, these two posts cover both sides of the coin:
- Async APIs → buffer with SQS.
- Sync APIs → throttle, pre-warm, or containerize.
☕️ Found This Helpful?
If this guide helped you debug a tricky front-end issue or saved you from a wild goose chase : you can buy me a coffee
A message from our Founder
Hey, Sunil here. I wanted to take a moment to thank you for reading until the end and for being a part of this community.
Did you know that our team run these publications as a volunteer effort to over 3.5m monthly readers? We don’t receive any funding, we do this to support the community. ❤️
If you want to show some love, please take a moment to follow me on LinkedIn, TikTok, Instagram. You can also subscribe to our weekly newsletter.
And before you go, don’t forget to clap and follow the writer️!