Every backend engineer has been there: your code works perfectly in staging. Tests pass. You deploy to production. Then, at 2 AM, a customer reports that "everything is broken."
Except this time, you can't just SSH into the server. The production environment is the customer's infrastructure — their cloud account, their VPC, their security policies, their firewall rules. You're debugging blind.
This is the reality for engineers working in customer-managed environments: solutions engineers doing POCs, implementation engineers deploying enterprise software, and forward deployed engineer embedding with customers. Your debugging playbook doesn't work when you don't control the infrastructure.
Here's how to debug production issues when you're operating in someone else's environment.
Why Customer Infrastructure Breaks Your Normal Debugging Flow
When you debug in your own infrastructure:
- Full access to logs, metrics, traces
- Can restart services, roll back deploys, modify configs
- Know the architecture (because you built it)
- Can replicate locally
When you debug in customer infrastructure:
- Limited access — Customer security policies restrict what you can see
- Can't touch production — No restart permissions, no config changes without approval
- Unknown architecture — Every customer has different setups (on-prem, AWS, Azure, hybrid cloud, air-gapped)
- Can't replicate locally — Customer data is proprietary/regulated
You're essentially doing surgery with mittens on.
The 5-Step Framework for Debugging Without Full Access
Step 1: Establish a Communication Channel (Before You Debug)
The worst debugging sessions start with:
- Engineer: "Can you send me the logs?"
- Customer: 12 hours later "What logs?"
- Engineer: "Application logs from the last hour"
- Customer: 6 hours later sends screenshot of log file
Before you start debugging, set up:
1. A dedicated Slack channel or video call Don't debug over email. You need real-time back-and-forth.
2. Screen-sharing access (if security allows) Even if you can't SSH in, watching the customer run commands is 10x faster than asking them to copy-paste output.
3. A designated customer contact who has infrastructure access Don't get stuck waiting for the customer's IT team to wake up.
Real example: When I was debugging a deployment failure for a healthcare customer, their security policy blocked all external SSH. I got on a Zoom call, shared my screen with a debugging checklist, and walked their DevOps engineer through each command. We found the issue (misconfigured network policy) in 20 minutes instead of 2 days of async email.
Step 2: Map the Architecture (Even If You Can't See It)
You need a mental model of the customer's infrastructure. Ask:
Infrastructure questions:
- "What cloud provider are you using? (AWS/Azure/GCP/on-prem)"
- "Is this Kubernetes, Docker, bare metal, or serverless?"
- "How is our application deployed? (Helm chart, Docker Compose, manual install)"
Networking questions:
- "Is the application in a private VPC?"
- "Are there firewalls or proxies between our service and external APIs?"
- "Do you use a load balancer or reverse proxy?"
Data questions:
- "Where is the database hosted? (same VPC, managed RDS, on-prem)"
- "Are there any data residency or compliance requirements? (HIPAA, GDPR, SOC2)"
Draw this on paper. Seriously. A 30-second sketch prevents hours of confusion.
Example architecture map (financial services customer):
[Our App Container] → [Internal Load Balancer] → [Customer Firewall]
↓
[Customer PostgreSQL (on-prem)]
↓
[Proxy Server] → [Internet] → [Our API]
Once I drew this, I realized: the customer's proxy was blocking outbound HTTPS. That's why our API calls were timing out.
Step 3: Debug with Indirect Signals (Not Direct Logs)
When you can't access logs directly, you need proxy signals.
A. Use Health Check Endpoints
If your app exposes /health or /metrics endpoints, these are gold.
Ask the customer to run:
curl http://localhost:8080/health
A good health check returns:
{
"status": "healthy",
"database": "connected",
"redis": "connected",
"external_api": "timeout"
}
Boom. Now you know it's the external API, not the database.
Pro tip: Build health checks that test every dependency (database, cache, message queue, external APIs). This is your debugging insurance policy.
B. Request Specific Log Snippets (Not Full Logs)
Don't ask for "all logs." You'll get a 500MB file you can't parse.
Instead, ask for targeted snippets:
* "Can you run: docker logs <container> | grep ERROR | tail -50"
* "Can you check application logs between 2:00-2:05 AM UTC?"
* "Search logs for the string 'connection refused'"
Give them the exact command. Don't assume they know grep.
C. Use Application-Level Instrumentation
If you anticipated this scenario, you built observability into your app:
- Structured logging (JSON logs with level, timestamp, trace_id, message)
- Metrics export (Prometheus, StatsD, CloudWatch)
- Distributed tracing (OpenTelemetry, Jaeger)
If your app exports metrics to a URL the customer can access (e.g., http://localhost:9090/metrics), you can debug without touching logs.
Example metric:
database_query_duration_seconds{query="get_user"} 12.4
If queries are taking 12 seconds, you found your issue.
Step 4: Reproduce in a Similar Environment (Not Theirs)
You can't break the customer's production. So spin up a mirror environment.
Ask the customer:
- "What OS version are you running?"
- "What database version?"
- "What environment variables are set?"
- "Can you share your deployment manifest? (Kubernetes YAML, Docker Compose file)"
Then replicate:
# Create a similar environment
docker run -e ENV_VAR=value -v /data:/data <customer-os-image>
# Install the same database version
docker run postgres:14.2 # Match their version exactly
Often, the issue reproduces locally once you match their setup.
Real bug I found this way: Customer was on PostgreSQL 12, we tested on PostgreSQL 14. Postgres 12 didn't support a JSON function we used. The query worked in our tests but failed in production. Reproducing with Postgres 12 caught it instantly.
Step 5: The "Isolation Elimination" Technique
When you're stuck, systematically eliminate variables.
Start from the outside in:
Can the customer reach external services?
curl https://api.example.com
- If this fails, it's a network/firewall issue, not your app.
Can the app reach the database?
psql -h <db-host> -U <db-user> -d <db-name>
- If this fails, it's a database connection issue.
- Can the app start without processing requests? Deploy the app but don't send traffic. If it crashes on startup, it's a config issue, not a load issue.
- Can the app process a single request? Send one test request. If it works, scale testing reveals the problem (race condition, connection pooling, etc.).
Work backward until you isolate the failure point.
Common "Customer Infrastructure" Gotchas
Here are the bugs I see repeatedly when debugging in customer environments:
1. Firewall Rules Blocking Outbound Traffic
Your app needs to call an external API. Works in staging. Fails in customer prod.
Cause: Customer firewall blocks outbound HTTPS on port 443.
How to detect:
# Ask customer to run:
curl -v https://api.example.com
# If timeout or "connection refused" → firewall issue
Fix: Ask customer to whitelist your API domain.
2. Environment Variables Not Set (Or Wrong)
Your app expects DATABASE_URL=postgres://.... Customer sets DB_URL=postgres://....
Cause: Deployment docs had a typo. Or customer assumed variable names.
How to detect:
# Ask customer to run:
env | grep DATABASE
# If empty → variable not set
Fix: Provide an exact .env template file. Don't make customers guess.
3. Database Connection Pooling Exhaustion
App works fine for 10 users. Crashes at 100 users.
Cause: Customer's database only allows 20 concurrent connections. Your app opens 50.
How to detect:
# Ask customer to run:
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
-- If maxed out → connection pool issue
Fix: Configure connection pooling (max_connections=10, idle_timeout=30s).
4. Clock Skew (Yes, Really)
Your app generates JWT tokens. Customer reports "token expired" errors immediately.
Cause: Customer server clock is 10 minutes behind.
How to detect:
# Ask customer to run:
date
# Compare to your system time
Fix: Customer runs NTP sync. Or you increase token expiration buffer.
5. Permissions Issues (User/Group/SELinux)
App crashes with "permission denied" when writing to /data.
Cause: App runs as user appuser, but /data is owned by root.
How to detect:
# Ask customer to run:
ls -la /data
# Check owner/group
Fix: chown -R appuser:appuser /data
The "Debug Toolkit" You Should Build Into Your App
If you anticipate customer-environment debugging, build these features:
1. Admin Debug Endpoints (Secured)
@app.route('/admin/debug', methods=['GET'])
@require_auth
def debug_info():
return {
"app_version": "1.2.3",
"database_status": check_db_connection(),
"external_api_status": check_api_connection(),
"disk_usage": get_disk_usage(),
"memory_usage": get_memory_usage(),
"env_vars_set": list_env_vars() # Don't expose values
}
Now the customer can hit GET /admin/debug and send you a JSON summary.
2. Verbose Logging Mode (Togglable)
Let customers enable DEBUG-level logs without redeploying:
# Customer runs:
curl -X POST http://localhost:8080/admin/logging -d '{"level": "DEBUG"}'
3. Self-Test Command
Include a self-test CLI command:
./app self-test
# Output:
✅ Database connection: OK
✅ Redis connection: OK
❌ External API: TIMEOUT
✅ Disk space: OK
Customer runs this before opening a support ticket.
When to Escalate (Don't Debug Forever)
Sometimes the issue isn't your app. Know when to escalate:
Escalate to customer IT if:
- Network connectivity issues (can't reach external APIs)
- Firewall/proxy/VPN configuration
- Infrastructure resource limits (CPU, memory, disk)
Escalate to your SRE team if:
- The issue reproduces in your staging environment
- Suspected bug in your application code
- Database schema migration failed
Escalate to product/sales if:
- Customer is using your product in an unsupported way
- Customer needs features you don't have
- Customer environment fundamentally incompatible (e.g., they require on-prem but you're cloud-only)
Don't waste 3 days debugging a firewall rule. Recognize infrastructure issues and escalate in the first hour.
The Meta-Skill: Debugging Through a Proxy
The hardest part of debugging in customer environments isn't the technical problem. It's debugging through another human.
You're not running commands. You're asking someone else to run commands and interpret output for you.
This requires:
- Clear communication (exact commands, no jargon)
- Patience (the customer engineer might not be familiar with your stack)
- Trust-building (customers get defensive when things break)
I've debugged issues where the root cause was clear in 5 minutes, but it took 2 hours because the customer engineer didn't trust my proposed fix and needed to "run it by security."
That's fine. Debugging in customer environments is 50% technical, 50% stakeholder management.
Key Takeaways
- Set up real-time communication before debugging (Slack/Zoom, not email)
- Map the architecture even if you can't see it
- Debug with indirect signals (health checks, targeted logs, metrics)
- Reproduce in a similar environment (match their OS, DB version, config)
- Isolate variables systematically (network → database → app → load)
- Build debugging tools into your app (debug endpoints, self-tests, toggleable logging)
- Know when to escalate (don't debug firewall rules for 3 days)
If you're building enterprise software, customer-environment debugging is inevitable. The engineers who master it become invaluable — because most engineers can't (or won't) do it.
For engineers working in these high-stakes customer-facing roles, FDE Academy offers specialized training that covers production debugging, stakeholder management, and the unique challenges of embedded engineering work.
Comments
Loading comments…