Official References: Best Practices · Review · Worktrees · Automations
Why hardening is a separate operating loop
Incident response gets you back to green. Hardening determines whether you stay green.
Treat hardening as a dedicated loop, not an afterthought.
Hardening timeline model
| Window | Objective | Mandatory artifact |
|---|---|---|
| 0–24h | preserve decision fidelity | incident timeline + evidence packet |
| 24–72h | block dominant failure class | one automation control + one regression test |
| 3–14d | institutionalize learning | runbook/process update + ownership confirmation |
Control portfolio buckets
Track hardening actions across:
- Detection controls: alert routing, signal quality, dashboard clarity
- Prevention controls: tests, policy checks, static safeguards
- Recovery controls: rollback rehearsals, handoff contracts, escalation maps
Each action needs owner, deadline, and verification evidence.
Lane-based execution model
- Controls lane: implement enforcement and automation
- Quality lane: lock regression coverage for root-cause class
- Operations lane: update runbooks and communication protocol
Assign one merge owner whenever lanes must touch shared files.
Reliability scorecard
Use binary scoring per control:
- root-cause class now test-covered
- rollback path rehearsed and timed
- escalation ownership documented
- monitoring/alerts updated
- runbook update acknowledged by owners
Minimum closure threshold: 4/5.
Hardening closure gate
Hardening is complete only when:
- scorecard threshold is met
- controls are active in routine delivery workflow
- owner adoption is confirmed
- no unresolved high-risk follow-up remains
Completion evidence bundle
Archive one bundle containing:
- incident id + short summary
- controls shipped
- test evidence links
- runbook/process diffs
- residual action owners + due dates
Advanced anti-patterns
Recovery declared “done” with no hardening backlog
This guarantees recurring incidents.
Controls added but not exercised
Unrehearsed controls fail under real pressure.
Documentation updated without ownership contract
The document exists, but execution accountability is still missing.
Quick checklist
Before closing hardening loop:
- one automation control merged
- one regression test merged
- runbook/process update published
- ownership confirmations recorded
Codex helps teams recover quickly. Hardening is how teams recover better over time.
Hardening portfolio prioritization matrix
Prioritize hardening actions with two axes:
- Risk reduction magnitude (how much future incident probability drops)
- Time to reliable adoption (how quickly teams can run this in normal flow)
| Priority band | Risk reduction | Adoption speed | Typical action |
|---|---|---|---|
| P0 | high | fast | add blocking policy check, add rollback validation gate |
| P1 | high | medium | add regression suite for root-cause class |
| P2 | medium | fast | improve runbook clarity, handoff templates |
| P3 | medium/low | slow | broader architecture cleanups |
Start with P0/P1 before polishing documentation language.
Hardening experiment design
Treat each hardening action as a small experiment:
- Hypothesis: "If we add X control, Y failure class drops by Z%."
- Measure: define metric and observation window.
- Guardrail: define rollback condition for the control itself.
- Owner: one execution owner and one verifier owner.
- Review date: fixed date, not "later".
Example
- hypothesis: mandatory pre-release rollback rehearsal reduces SEV-1 MTTR variance
- measure: p95 MTTR over next 4 incidents/drills
- guardrail: if release lead time regresses >20% for 2 cycles, recalibrate gate
Ownership operating rhythm
Weekly reliability standup (20 minutes)
- review overdue hardening items
- review items with weak evidence links
- reassign blocked items immediately
- close items only with proof artifact links
Monthly control audit
- confirm controls are still running (not silently bypassed)
- confirm owners are still valid (team changes happen)
- retire controls that no longer reduce observed risk
Hardening quality rubric (0–2 per row)
| Dimension | 0 | 1 | 2 |
|---|---|---|---|
| Clarity | vague task | partial scope | precise change + boundary |
| Evidence | none | screenshot only | command/log/proof links |
| Ownership | missing | single owner only | owner + verifier + due date |
| Adoption | unverified | anecdotal | observed in real workflow |
| Risk impact | unknown | assumed | measured trend impact |
Minimum passing total: 7/10. Otherwise keep loop open.
Hardening backlog template
### Hardening Item
- Incident reference:
- Failure class:
- Proposed control:
- Priority band: P0/P1/P2/P3
- Execution owner:
- Verifier owner:
- Due date:
- Evidence link(s):
- Adoption check date:
- Status: open | in_progress | verified | retiredCommon failure modes in hardening programs
- backlog closes with no measurable effect on metrics
- controls are added but not socialized to operators
- runbook updates happen without updating on-call checklists
- ownership stays with a departed engineer
Solve these with explicit monthly audit, not ad-hoc memory.
From incident finding to control design
Convert findings with this mapping:
| Finding type | Control candidate | Verification style |
|---|---|---|
| missed detection | alert rule + routing update | synthetic alert replay |
| late decision | checkpoint SLA + commander script | drill timing audit |
| slow recovery | rollback rehearsal gate | timed rollback run |
| repeated regression | test harness + policy check | regression suite trend |
This mapping keeps hardening concrete.
14-day hardening sprint template
- Day 1–2: rank findings by risk and reversibility
- Day 3–5: implement top P0/P1 controls
- Day 6–8: add verification coverage and evidence links
- Day 9–11: run adoption checks in real workflow
- Day 12–14: publish summary + residual risk owners
If schedule slips, reduce scope but keep verification intact.
Residual risk register format
### Residual Risk
- Risk statement:
- Why not fully solved yet:
- Temporary control in place:
- Trigger for re-prioritization:
- Owner:
- Target resolution date:Residual risk without trigger/date is unmanaged risk.
Control economics framework
Every hardening control has cost. Make tradeoffs explicit.
| Control type | Reliability gain horizon | Ongoing cost | Recommended use |
|---|---|---|---|
| Blocking policy gate | immediate | medium | high-severity repeat failures |
| Regression suite expansion | short/medium | medium/high | root-cause class recurrence |
| Runbook + template upgrade | short | low | ownership ambiguity problems |
| Architecture simplification | medium/long | high | chronic complex failure chains |
Choose controls that improve signal-to-effort ratio first.
Adoption hardening checklist
A control is not complete at merge time. Confirm adoption explicitly.
- owners can execute the new control without side-channel knowledge
- failure path for the control itself is documented
- on-call checklist includes the new control point
- dashboard or log evidence proves the control actually runs
Merged-but-unused controls create false confidence.
Hardening QA gate
Before marking a hardening item verified, require:
- implementation evidence (diff + command output)
- behavioral evidence (drill or incident replay signal)
- ownership evidence (named operator + verifier signoff)
Missing any one keeps status at in_progress.
Control sunset protocol
Controls should not live forever by default.
Retire a control when:
- it no longer improves target metrics
- it duplicates a stronger control
- operational cost exceeds measured risk reduction
Retirement requires a short note with replacement or rationale.
Control rollout sequencing
Roll out new controls in three phases:
- Shadow mode — observe behavior without blocking flow
- Warn mode — surface violations with owner attribution
- Enforce mode — block on defined high-risk conditions
This sequencing avoids abrupt disruption while increasing safety.
Hardening adoption score
Track adoption with a simple score (0–5):
- control enabled in target environment
- owners trained and acknowledged
- runbook updated with execution steps
- dashboard signal confirms runtime execution
- one real incident/drill used the control
Score below 4 means adoption is incomplete.
Reliability debt ledger
Keep a dedicated ledger for deferred hardening work:
| Debt item | Why deferred | Risk if delayed | Revisit date |
|---|
Deferred work without revisit dates tends to become permanent risk.