Troubleshooting Faster by Narrowing the Blast Radius First

Most bad outages get worse because we start with guesses and broad changes. Over time I moved to a stricter sequence: contain first, then investigate.

Step 1: Contain impact before root cause hunting

If users are down, I prioritize blast radius reduction:

isolate failing node or upstream
route traffic to known-good path
freeze non-essential config changes

This is not avoidance. It creates stable ground so diagnosis is based on facts, not moving targets.

Step 2: Compare working vs failing paths

I always want one good reference and one bad reference. For network and app-edge issues that usually means:

compare headers at the proxy boundary
compare DNS resolution from two network segments
compare TLS handshake behavior from internal and external clients

Diffing two concrete states beats scanning random logs hoping for a clue.

Step 3: Kill assumptions quickly

Assumptions are where time disappears. Common examples:

"DNS is fine" without checking authoritative and recursive responses
"cert is valid" without validating full chain and host match
"upstream is healthy" based only on container status

I write the assumption, run one command to disprove it, then move on.

Step 4: Keep a decision log during incident work

A short running log saves me from loops:

timestamp
action
observed result
next decision

That record is more useful than a polished postmortem written from memory the next day.

A real pattern that repeats

Recent reverse proxy incidents in my lab had different symptoms but the same root shape: stale assumptions around DNS or upstream health checks. The fix was not heroics. It was enforcing the same checklist every time.

Operational discipline is not glamorous, but it scales. It also makes handoff easier when life happens and someone else needs to step in.