Your Deployment Will Fail Tonight. Here’s How to Survive It.

Every engineer has a deploy story.

The one that went wrong at the worst possible time. Friday afternoon. Holiday weekend. 10 minutes before a big demo.

Mine was a Tuesday. 16:47. I pushed a config change to production. By 17:03 we had lost payment processing for 8,000 users.

The change was one line. A timeout value. I was “just” increasing it from 5 seconds to 15.

That one line, combined with an already-degraded upstream dependency, caused connection pool exhaustion that took down checkout completely.

The deploy took 30 seconds. The recovery took 47 minutes.

And the pre-deploy check that would have caught it would have taken 5 minutes.

Why deploys fail

Not because engineers are careless. Because deploys are the highest-risk moment in a system’s lifecycle and most teams treat them like routine operations.

Every deploy is a change to a running system. Running systems are in equilibrium. Changes break equilibrium.

The question isn’t whether your deploy will cause problems. It’s whether you’ll catch the problems before they reach users.

That’s what a pre-deploy checklist does.

The 5-minute checklist

I built this after my payment processing incident. Every item on it came from a real incident — mine or someone else’s.

1. What exactly is changing? (60 seconds)

Write it down. One sentence.

“Increasing Stripe client timeout from 5s to 15s in payment-service.”

If you can’t write it in one sentence, you don’t understand the change well enough to deploy it.

This sounds obvious. It isn’t. I’ve seen engineers deploy changes they couldn’t articulate clearly — and discover mid-incident that they didn’t fully understand what they’d changed.

2. What’s the blast radius? (60 seconds)

If this goes wrong, what breaks?

One endpoint? One service? Everything downstream of this service?

If the blast radius is large, you need a canary deploy or a feature flag. Not because something will go wrong — because if something does go wrong, you want to limit the damage to 5% of users, not 100%.

3. Do you have a rollback plan? (60 seconds)

Not “I can roll back.” The exact command.

kubectl set image deployment/payment-service payment-service=payment-service:v2.4.0

Written down. Ready to run. Not something you’ll reconstruct under pressure at 17:15 when checkout is down.

If you don’t have a rollback plan, you don’t have a deploy plan.

4. What does success look like? (60 seconds)

What metrics will you watch for the first 10 minutes after deploy?

Error rate. Latency. Specific business metrics affected by this change.

If you’re changing payment timeout: watch payment success rate, connection pool utilization, Stripe API response times.

You can’t know something is wrong if you don’t know what right looks like.

5. Is now the right time? (60 seconds)

Three questions:

Is traffic at a normal level? (Don’t deploy during peak hours unless you have to.)

Is the system healthy right now? (A degraded dependency + your change = amplified risk.)

Is someone available to help if this goes wrong? (Friday at 17:45 is not the time to find out your change has a problem.)

If any answer is no, wait.

The deploy process that prevents most incidents

The checklist is the before. The deploy process is the during.

Step 1: Deploy to 5% of traffic first

Canary deploys catch problems before they affect everyone. If error rate spikes on 5% of traffic, you’ve prevented an incident for the other 95%.

If you don’t have canary infrastructure: use a feature flag. Deploy the code to everyone, enable the behavior for 5% of users.

Step 2: Watch for 10 minutes before going to 100%

The metrics you identified in step 4. Watch them. Don’t walk away.

Most deploy-related incidents show their signal within the first 5–10 minutes. If you’re watching, you’ll catch them before they become outages.

Step 3: Document what you deployed

One line in your team’s deploy log: what changed, when, who deployed it.

In 6 months when something breaks and someone asks “what changed before this started?” — that log is the answer.

The deploys that always go wrong

After enough incidents you start seeing patterns.

Friday afternoon deploys. Something goes wrong, your team is leaving for the weekend, you’re debugging alone.

“Quick” config changes. The ones that don’t seem to need review. They do.

Deploys to a degraded system. Your cache is running slow. This is the worst time to change anything — your change will get blamed for every pre-existing problem.

Deploys without monitoring. You shipped it, assumed it worked, found out 3 hours later from a support ticket.

Deploys with no rollback plan. The ones where the database migration can’t be reversed. Now you’re committed regardless of what happens.

What changes after you use this checklist

Not zero incidents. Incidents still happen.

But two things change.

First: the incidents that do happen are caught faster. Because you know exactly what changed, you know where to look first.

Second: the incidents that would have been caused by your deploy — don’t happen. Because you caught the risk before shipping.

The 5 minutes before a deploy are the highest-leverage 5 minutes in a deployment cycle.

Senior engineers know this. They never skip them.

If you want the complete pre-deploy system:

The full checklist, rollback templates, and the monitoring setup we use before every significant deploy.

👉 Production Incident Prevention Kit

And if you want the full production engineering system:

👉 Production Engineer OS

I write about production engineering, system design, and backend development every week.

→ Subscribe here: Devrim’s Engineering Notes

Your Deployment Will Fail Tonight. Here’s How to Survive It.

The 5-minute pre-deploy checklist senior engineers never skip.

Comments

Promote your content

Join our developer community

Main Menu

Your Deployment Will Fail Tonight. Here’s How to Survive It.

The 5-minute pre-deploy checklist senior engineers never skip.

Comments

Promote your content

Join our developer community