Official References: Best Practices · Review · Worktrees · Automations
Why resilience drills are non-optional at scale
Incident plans prove intent. Drills prove capability.
If teams never rehearse under pressure, recovery quality is a guess.
Drill maturity tiers
| Tier | Scope | Cadence | Pass criteria |
|---|---|---|---|
| Decision tabletop | ownership and branching logic | weekly | no ambiguous decision ownership |
| Service simulation | one system or lane | biweekly | target recovery window met with evidence |
| Full-fidelity simulation | multi-lane coordinated recovery | monthly | mitigation + verification + comms + follow-up complete |
Scenario design packet
Every drill starts with:
- scenario hypothesis
- trigger mechanism
- blast-radius boundary
- abort criteria
- commander and score owner
Missing packet fields produce noisy results.
Lane orchestration
- Injection lane: trigger controlled failure
- Response lane: execute mitigation decision
- Verification lane: validate restored behavior
- Comms lane: run timeline and escalation updates
One drill commander keeps checkpoint timing strict.
Resilience scorecard
Binary score per row:
- detection latency within target
- ownership remained explicit
- mitigation stayed reversible
- verification evidence was fresh
- follow-up owners assigned with deadlines
Minimum passing score: 4/5.
Checkpoint decision discipline
At each checkpoint, require one explicit decision:
- continue mitigation
- rollback to stable revision
- escalate and pause rollout
Implicit decisions create hidden failure paths.
Scenario mutation policy
Do not repeat identical simulations.
Mutate at least one variable per cycle:
- fault timing
- dependency class
- owner availability
- communication constraints
Mutation builds adaptive resilience.
Quarterly drill program
- run one full-fidelity simulation minimum
- rotate commander and observers
- review repeated low-scoring dimensions
- retire controls that do not move score trends
High-quality drills beat high-volume drills.
Advanced anti-patterns
Score inflation without evidence
Optimized metrics without proof give false confidence.
Commander overloaded with lane ownership
Decision quality drops when one person holds all signals.
Follow-ups logged without due dates
Undated work is latent incident risk.
Quick checklist
Before closing a drill cycle:
- scorecard archived
- checkpoint decisions recorded
- scenario mutation documented
- follow-up owners and deadlines assigned
Codex accelerates response execution. Drills verify response reliability.
Drill scenario catalog (starter set)
Rotate scenarios to avoid memorized responses.
Reliability scenario set
- Dependency timeout storm — primary API latency spikes beyond SLO.
- Config drift release — one environment receives stale flag values.
- Queue backlog saturation — processing lag creates cascading failures.
- Observability blackout — one critical dashboard panel fails during incident.
- Owner unavailable — primary on-call unavailable at first checkpoint.
Each cycle: pick one technical failure + one coordination failure.
Observer scoring pack
Observers should score behavior, not personality.
| Dimension | What to observe |
|---|---|
| Detection quality | Was the first signal recognized and triaged correctly? |
| Decision quality | Was a reversible decision made quickly? |
| Ownership clarity | Did every checkpoint name a next owner? |
| Evidence quality | Were commands/logs captured at each checkpoint? |
| Communication cadence | Were updates sent on promised cadence? |
Add evidence links for every score.
Drill timeline template (45 minutes)
- 00:00–05:00 scenario brief + success criteria
- 05:00–15:00 first signal + triage decision
- 15:00–30:00 mitigation path execution
- 30:00–40:00 verification and stability checks
- 40:00–45:00 debrief capture + follow-up assignment
If timeline overruns, log reason as process debt.
Communication scripts for pressure moments
First 5-minute update
Incident drill started at <time>
Observed signal: <summary>
Current branch: triage/mitigate/rollback
Next checkpoint: <time>
Commander: <name>Escalation checkpoint update
Escalation reason: <threshold breach>
Decision: continue | rollback | pause
Immediate owner: <name>
Verification owner: <name>
Next update at: <time>Debrief decision matrix
After drill, classify every finding:
- Fix now (high-risk + low effort)
- Schedule next cycle (high-risk + medium effort)
- Observe (unclear impact; collect more evidence)
- Drop (no measurable reliability value)
Never leave findings uncategorized.
Mutation planning for next cycle
Design next drill by mutating one factor intentionally:
- failure starts 10 minutes earlier/later
- key dependency fails differently
- comms channel is delayed
- backup owner must lead
Write mutation rationale so score changes are interpretable.
Drill completion gate
A drill cycle is complete only when:
- scorecard + evidence links are archived
- at least one follow-up action is owner-assigned
- next mutation scenario is drafted
- commander signs off on decision quality notes
Without this gate, drills become one-off events.
Drill scoring normalization
To compare drills across weeks, normalize scores:
- weight detection and decision quality higher for SEV-1 style scenarios
- weight comms cadence higher for multi-stakeholder scenarios
- always publish both raw score and weighted score
Example weighting
- detection quality: 30%
- decision quality: 25%
- ownership clarity: 20%
- evidence quality: 15%
- communication cadence: 10%
Adjust weights by scenario class, but document changes.
Commander playbook for stalled drills
If the drill stalls for >5 minutes without decision:
- freeze additional discussion
- force explicit branch selection (continue/rollback/escalate)
- assign execution owner immediately
- schedule next checkpoint in 5 minutes
This prevents analysis paralysis during rehearsal.
Debrief conversion rule
Every debrief output must become one of:
- merged control
- scheduled control with owner/date
- documented rejection with reason
No orphan findings.
Quarterly drill campaign structure
Run a campaign, not isolated events.
- Month 1: response-speed emphasis (detection + decision latency)
- Month 2: coordination emphasis (handoff and communication integrity)
- Month 3: recovery-quality emphasis (verification depth + follow-up closure)
Campaign design makes score trends interpretable.
Stress modifiers for realism
Add one stress modifier to each drill:
- delayed signal visibility
- partial owner availability
- conflicting stakeholder requests
- degraded observability channel
Stress modifiers reveal brittle processes hidden by “clean” simulations.
Drill evidence minimum
Each drill output must include:
- timeline with decision timestamps
- command-level verification snippets
- owner handoff chain
- comms updates sent vs promised
- follow-up action mapping
Without this, scorecards are storytelling, not evidence.
Calibration review after every 3 drills
After every third drill, run calibration:
- compare weighted score trends
- identify over-weighted dimensions
- adjust scoring weights with rationale
- publish changed rubric before next cycle
Transparent calibration prevents metric gaming.
Multi-team drill federation model
For larger orgs, run drills with federation:
- platform team owns shared infrastructure scenarios
- product team owns customer-path scenarios
- security team injects trust-boundary failures
Federation exposes cross-team coupling early.
Drill quality KPIs
Measure program quality, not just single drill scores:
- % drills with complete evidence bundle
- % follow-up actions closed by due date
- median time to first explicit checkpoint decision
- recurrence rate of identical failure mode findings
If KPI trend worsens, simplify scenario scope and restore rigor.
Observer bias controls
Reduce scoring bias:
- rotate observers each cycle
- require evidence links for low/high scores
- blind one observer to team names when feasible
Better scoring quality improves downstream hardening decisions.