Most bad outages get worse because we start with guesses and broad changes. Over time I moved to a stricter sequence: contain first, then investigate.

Step 1: Contain impact before root cause hunting

If users are down, I prioritize blast radius reduction:

  • isolate failing node or upstream
  • route traffic to known-good path
  • freeze non-essential config changes

This is not avoidance. It creates stable ground so diagnosis is based on facts, not moving targets.

Step 2: Compare working vs failing paths

I always want one good reference and one bad reference. For network and app-edge issues that usually means:

  • compare headers at the proxy boundary
  • compare DNS resolution from two network segments
  • compare TLS handshake behavior from internal and external clients

Diffing two concrete states beats scanning random logs hoping for a clue.

Step 3: Kill assumptions quickly

Assumptions are where time disappears. Common examples:

  • "DNS is fine" without checking authoritative and recursive responses
  • "cert is valid" without validating full chain and host match
  • "upstream is healthy" based only on container status

I write the assumption, run one command to disprove it, then move on.

Step 4: Keep a decision log during incident work

A short running log saves me from loops:

  • timestamp
  • action
  • observed result
  • next decision

That record is more useful than a polished postmortem written from memory the next day.

A real pattern that repeats

Recent reverse proxy incidents in my lab had different symptoms but the same root shape: stale assumptions around DNS or upstream health checks. The fix was not heroics. It was enforcing the same checklist every time.

Operational discipline is not glamorous, but it scales. It also makes handoff easier when life happens and someone else needs to step in.