Official References: Best Practices · Hooks · Security · GitHub Actions
Why drills are mandatory at advanced maturity
A runbook that is never exercised is only documentation.
Resilience drills convert written policy into verified team behavior.
Drill tier model
| Tier | Scope | Frequency | Success signal |
|---|---|---|---|
| Tabletop | decision flow + ownership | weekly | all owners make correct decision handoffs |
| Partial simulation | one lane or one subsystem | biweekly | target SLO restored within agreed window |
| Full simulation | cross-lane end-to-end recovery | monthly | recovery + communication + follow-up all completed |
Pre-drill design packet
Prepare before every drill:
- scenario statement
- expected failure signature
- blast-radius boundary
- stop condition
- observers and scoring owner
No design packet means results cannot be compared over time.
Execution lane split
- Failure injection lane: trigger scenario safely
- Response lane: execute containment and mitigation
- Verification lane: run checks and evidence capture
- Comms lane: timeline updates and escalation notices
Assign one owner per lane and one global drill commander.
Resilience scorecard
Score each drill (0/1 per row):
- detection happened in target window
- ownership handoff stayed explicit
- mitigation decision was reversible
- verification evidence was fresh
- post-drill follow-up owner assigned
Score below 4/5 means retry with narrowed scope.
Decision checkpoint protocol
At each checkpoint, force a clear choice:
- continue mitigation
- rollback to stable state
- escalate and pause rollout
Unstated decisions are hidden risks.
Failure-handling rule
If drill execution diverges from scenario assumptions:
- stop simulation
- record divergence
- convert divergence into next drill input
Never smooth over unexpected behavior.
Quarterly resilience program
Every quarter:
- run at least one full-scope drill
- rotate drill commander role
- review recurring low-score dimensions
- retire controls that never move reliability
Resilience improves through iteration quality, not drill count.
Advanced anti-patterns
Drill becomes performance theater
If people optimize for optics, reliability signal collapses.
Same scenario repeated with no mutation
Teams memorize one path instead of building adaptive capability.
Post-drill actions tracked without owners
Unowned actions become reliability debt.
Quick checklist
Before closing a drill cycle:
- scorecard captured
- decision checkpoints logged
- follow-up owners assigned
- next scenario drafted
Claude can help teams respond faster. Drills prove teams can respond correctly.
Drill scenario catalog (starter set)
Rotate scenarios to avoid memorized responses.
Reliability scenario set
- Dependency timeout storm — primary API latency spikes beyond SLO.
- Config drift release — one environment receives stale flag values.
- Queue backlog saturation — processing lag creates cascading failures.
- Observability blackout — one critical dashboard panel fails during incident.
- Owner unavailable — primary on-call unavailable at first checkpoint.
Each cycle: pick one technical failure + one coordination failure.
Observer scoring pack
Observers should score behavior, not personality.
| Dimension | What to observe |
|---|---|
| Detection quality | Was the first signal recognized and triaged correctly? |
| Decision quality | Was a reversible decision made quickly? |
| Ownership clarity | Did every checkpoint name a next owner? |
| Evidence quality | Were commands/logs captured at each checkpoint? |
| Communication cadence | Were updates sent on promised cadence? |
Add evidence links for every score.
Drill timeline template (45 minutes)
- 00:00–05:00 scenario brief + success criteria
- 05:00–15:00 first signal + triage decision
- 15:00–30:00 mitigation path execution
- 30:00–40:00 verification and stability checks
- 40:00–45:00 debrief capture + follow-up assignment
If timeline overruns, log reason as process debt.
Communication scripts for pressure moments
First 5-minute update
Incident drill started at <time>
Observed signal: <summary>
Current branch: triage/mitigate/rollback
Next checkpoint: <time>
Commander: <name>Escalation checkpoint update
Escalation reason: <threshold breach>
Decision: continue | rollback | pause
Immediate owner: <name>
Verification owner: <name>
Next update at: <time>Debrief decision matrix
After drill, classify every finding:
- Fix now (high-risk + low effort)
- Schedule next cycle (high-risk + medium effort)
- Observe (unclear impact; collect more evidence)
- Drop (no measurable reliability value)
Never leave findings uncategorized.
Mutation planning for next cycle
Design next drill by mutating one factor intentionally:
- failure starts 10 minutes earlier/later
- key dependency fails differently
- comms channel is delayed
- backup owner must lead
Write mutation rationale so score changes are interpretable.
Drill completion gate
A drill cycle is complete only when:
- scorecard + evidence links are archived
- at least one follow-up action is owner-assigned
- next mutation scenario is drafted
- commander signs off on decision quality notes
Without this gate, drills become one-off events.
Drill scoring normalization
To compare drills across weeks, normalize scores:
- weight detection and decision quality higher for SEV-1 style scenarios
- weight comms cadence higher for multi-stakeholder scenarios
- always publish both raw score and weighted score
Example weighting
- detection quality: 30%
- decision quality: 25%
- ownership clarity: 20%
- evidence quality: 15%
- communication cadence: 10%
Adjust weights by scenario class, but document changes.
Commander playbook for stalled drills
If the drill stalls for >5 minutes without decision:
- freeze additional discussion
- force explicit branch selection (continue/rollback/escalate)
- assign execution owner immediately
- schedule next checkpoint in 5 minutes
This prevents analysis paralysis during rehearsal.
Debrief conversion rule
Every debrief output must become one of:
- merged control
- scheduled control with owner/date
- documented rejection with reason
No orphan findings.
Quarterly drill campaign structure
Run a campaign, not isolated events.
- Month 1: response-speed emphasis (detection + decision latency)
- Month 2: coordination emphasis (handoff and communication integrity)
- Month 3: recovery-quality emphasis (verification depth + follow-up closure)
Campaign design makes score trends interpretable.
Stress modifiers for realism
Add one stress modifier to each drill:
- delayed signal visibility
- partial owner availability
- conflicting stakeholder requests
- degraded observability channel
Stress modifiers reveal brittle processes hidden by “clean” simulations.
Drill evidence minimum
Each drill output must include:
- timeline with decision timestamps
- command-level verification snippets
- owner handoff chain
- comms updates sent vs promised
- follow-up action mapping
Without this, scorecards are storytelling, not evidence.
Calibration review after every 3 drills
After every third drill, run calibration:
- compare weighted score trends
- identify over-weighted dimensions
- adjust scoring weights with rationale
- publish changed rubric before next cycle
Transparent calibration prevents metric gaming.
Multi-team drill federation model
For larger orgs, run drills with federation:
- platform team owns shared infrastructure scenarios
- product team owns customer-path scenarios
- security team injects trust-boundary failures
Federation exposes cross-team coupling early.
Drill quality KPIs
Measure program quality, not just single drill scores:
- % drills with complete evidence bundle
- % follow-up actions closed by due date
- median time to first explicit checkpoint decision
- recurrence rate of identical failure mode findings
If KPI trend worsens, simplify scenario scope and restore rigor.
Observer bias controls
Reduce scoring bias:
- rotate observers each cycle
- require evidence links for low/high scores
- blind one observer to team names when feasible
Better scoring quality improves downstream hardening decisions.