Official References: Best Practices · Review · Worktrees · Automations
Why this playbook exists
Incident-time failures are usually process failures first:
- unclear ownership
- weak evidence snapshots
- late rollback decisions
This playbook keeps Codex incident response deterministic while preserving recovery speed.
Severity matrix
| Level | Typical impact | Default response posture |
|---|---|---|
| SEV-1 | user safety, auth, billing, or data correctness at risk | rollback-first evaluation |
| SEV-2 | major workflow degradation | bounded hotfix with tight verification |
| SEV-3 | isolated non-critical regression | scheduled corrective release |
Declare severity at incident start and revise only with evidence.
First-15-minutes protocol
- assign incident owner and deputy
- freeze unrelated merges in impacted surfaces
- capture baseline evidence (failing commands, logs, blast radius)
- open lane tracker with explicit next owner per lane
No next owner means no reliable continuity.
Four-lane recovery model
Run in parallel with explicit boundaries:
- Triage lane: classify severity and confidence
- Mitigation lane: prepare feature-flag disable / hotfix / rollback
- Verification lane: run reproducible reruns and safety checks
- Comms lane: update internal/external stakeholders on cadence
If lanes edit the same files, define one merge owner immediately.
Reversible decision ladder
Prefer the safest reversible option first:
- disable risky path via flag/config
- targeted patch with bounded scope
- full rollback to last known-good revision
Document rejected alternatives, not only chosen action.
Evidence packet standard
Every major checkpoint must include:
- current severity + owner + timestamp
- impacted systems/users
- commands executed and outputs
- residual risk statement
- next checkpoint time + owner
This packet is the source of truth for handoffs.
Recovery completion gate
Close only when all are true:
- mitigation applied
- fresh rerun evidence is green
- residual risks documented
- follow-up hardening task assigned with owner/date
"Looks stable" is not closure criteria.
Post-incident hardening loop
Within 24 hours:
- publish incident timeline with key decisions
- add at least one guardrail automation
- add one regression test for root-cause class
- assign follow-up owner and deadline
Incidents are expensive. Convert them into durable controls.
Advanced anti-patterns
Hero mode in one lane
Fast initially, fragile during handoff and fatigue windows.
Mitigation without communication cadence
Code changes ship, but teams operate on stale assumptions.
Closure without rerun proof
Confidence language cannot replace command evidence.
Quick checklist
Before declaring resolved:
- severity + ownership recorded
- mitigation path justified
- fresh rerun evidence attached
- hardening follow-up assigned
Codex can accelerate recovery. This playbook keeps recovery governable.
Scenario library (real pressure patterns)
Use this library to practice the playbook under realistic pressure instead of ideal paths.
Scenario A — Auth token failure after release
- Signal: sudden spike in 401/403 from one service after deployment.
- Risk: support load surge + customer trust degradation.
- Decision branch: feature-flag rollback vs targeted hotfix.
- Evidence minimum: failing endpoint list, first bad deploy hash, rollback safety check output.
- Owner handoff: investigation owner → mitigation owner → verification owner.
Scenario B — Billing state mismatch
- Signal: dashboard revenue events diverge from transaction ledger.
- Risk: financial and legal exposure.
- Decision branch: immediate write freeze vs selective replay.
- Evidence minimum: mismatch sample size, blast radius estimate, safe replay boundary.
- Owner handoff: data owner + incident commander dual sign-off required.
Scenario C — Permission leak in edge flow
- Signal: low-volume but severe authorization bypass report.
- Risk: security incident escalation.
- Decision branch: emergency rollback and access revocation first, then patch.
- Evidence minimum: reproducible minimal case, revocation completion proof, audit trail snapshot.
War-room artifact bundle (copy/paste)
### Incident Command Snapshot
- Incident ID:
- Severity:
- Incident commander:
- Current phase: detection | containment | recovery | hardening
- Latest stable revision:
- Candidate mitigation path:
- Risk if we wait 30 minutes:
- Next checkpoint at:
- Next owner:### Checkpoint Decision Record
- Timestamp:
- Evidence reviewed:
- Decision: continue mitigation | rollback | escalate+pause
- Why this decision now:
- Rejected alternatives:
- Owner for execution:
- Owner for verification:Escalation message templates
Internal engineering escalation
[INCIDENT][SEV-X] <short summary>
Impact: <users/systems>
Current decision: <path>
Immediate ask: <approval/resource>
Next update: <time>
Owner: <name>Stakeholder update
Status: investigating/mitigating/recovered
Customer impact: <plain language>
Current mitigation: <plain language>
Known unknowns: <top 1-2>
Next committed update time: <time>Closure acceptance criteria (hard gate)
Close incident response only when all conditions are true:
- mitigation path executed and verified with fresh command evidence
- rollback trigger and owner are still documented for 24h watch window
- residual risks are explicit (not "none" by default)
- hardening backlog item owners and due dates are assigned
- communications timeline is complete and reviewable
If one condition is missing, status stays active.
30-minute post-closure review
Run a short review immediately after closure:
- what signal was first but ignored?
- which decision checkpoint took longest?
- where did ownership become ambiguous?
- what one guardrail would have reduced resolution time most?
- what one metric should be added or redefined?
Turn answers into backlog items before context fades.
Provider-specific readiness checks
Before running this playbook, enforce these prechecks:
- Repository state: identify last known-good revision and confirm rollback path is executable now.
- Owner graph: commander, mitigation owner, verifier owner, and communications owner are all named.
- Evidence channel: one canonical thread/document where all checkpoint records are written.
Minimal precheck command set
git rev-parse --short HEAD
git log --oneline -n 5
# add your service health check hereStore command output in the first checkpoint record.
Decision quality guardrails
For each checkpoint decision, require three statements:
- reversibility statement — how quickly can we undo this decision?
- blast-radius statement — what can get worse if this is wrong?
- verification statement — what exact signal proves this worked?
If one statement is missing, decision quality is below advanced standard.
Handoff acceptance test
Before handing to next owner, verify:
- scope boundary is explicit (what is included/excluded)
- unresolved unknowns are listed (not hidden in chat history)
- next checkpoint time is committed
- failure trigger for immediate rollback is written
Use this quick test to reduce context loss across lanes.
60-minute command timeline blueprint
Use this timeline when severity is unclear but impact is real.
- T+00–10: confirm severity hypothesis, freeze risky merges, assign commander + deputy.
- T+10–20: establish mitigation branch options and rollback readiness.
- T+20–35: execute one branch decisively; avoid parallel contradictory mitigations.
- T+35–50: run verification and compare against pre-incident baseline.
- T+50–60: publish decision note and commit the next checkpoint schedule.
The point is not speed alone; it is synchronized decision quality under pressure.
Dependency risk matrix
Classify affected dependencies before choosing mitigation.
| Dependency class | Failure symptom | Default incident posture |
|---|---|---|
| Auth/identity | access denial or bypass | rollback-first + audit capture |
| Billing/ledger | transaction mismatch | write freeze + reconciliation boundary |
| Messaging/queue | lag and duplicate processing | flow throttling + replay guard |
| Observability | missing/late signals | conservative rollback threshold |
This matrix prevents overconfident “single-fix” decisions.
Contradiction handling protocol
If two owners provide conflicting mitigation recommendations:
- capture both recommendations in one decision record
- score reversibility and blast radius for each
- choose the more reversible path unless evidence disproves it
- schedule a rapid reassessment checkpoint (<=10 minutes)
Conflicts are normal. Unstructured conflicts are dangerous.
Decision confidence ladder
Tag each checkpoint decision with confidence:
- L1 (low): incomplete evidence, reversible action only
- L2 (medium): partial evidence, bounded mitigation allowed
- L3 (high): consistent evidence, broader rollout or closure allowed
Never close incident response with L1 confidence.
Recovery branch strategy matrix
Choose branch strategy deliberately:
| Branch type | When to use | Risk |
|---|---|---|
| Hotfix branch | isolated code path regression | hidden side-effect risk |
| Rollback branch | broad uncertainty or safety risk | reintroducing known debt |
| Containment branch | partial mitigation while investigating | prolonged temporary state |
Do not maintain all three branches active without one branch owner.
Incident closure review package
Before final closure, produce one package containing:
- timeline (key timestamps + decisions)
- mitigation diff summary
- verification command bundle
- residual risk register
- hardening backlog links
This package should let a new owner understand the incident in 10 minutes.
Leadership handoff note template
### Incident Leadership Note
- What happened:
- Why this decision path was chosen:
- Current confidence level (L1/L2/L3):
- What remains uncertain:
- What we need from leadership:
- Owner for next 24h watch:Short, explicit leadership notes reduce re-litigation of decisions.