Official References: Best Practices · Hooks · Security · GitHub Actions
Why this loop matters
Incident response restores service. Hardening prevents repeat incidents.
Without a hardening loop, every incident stays expensive.
Time-boxed hardening timeline
| Window | Objective | Required output |
|---|---|---|
| 0–24h | preserve signal | timeline + evidence packet |
| 24–72h | close the top failure mode | one guardrail automation + one regression test |
| 3–14d | reduce class-level risk | runbook/process update + owner-confirmed checklist |
Hardening backlog model
Classify actions into three buckets:
- Detect faster: alert thresholds, dashboard hints, pager routing
- Prevent recurrence: tests, static checks, policy hooks
- Recover faster: rollback script reliability, handoff templates, owner maps
Each item must have one owner and one due date.
Three-lane hardening execution
- Controls lane: adds hooks, checks, and CI guardrails
- Quality lane: adds regression coverage for root-cause class
- Operations lane: updates runbook + communication protocol
Keep lanes independent when possible to reduce merge friction.
Readiness scorecard
Use a compact scorecard (0 or 1 per row):
- root-cause class covered by test
- rollback path rehearsed
- escalation owner documented
- dashboard/alerts updated
- runbook updated and acknowledged
Score < 4 means hardening is incomplete.
Closure decision rule
Close hardening only when all are true:
- scorecard reaches target
- new guardrails are running in normal workflow
- follow-up owners confirm adoption
- no unresolved high-risk action remains
Evidence packet template
At hardening completion, archive:
- incident id + summary
- implemented controls
- test evidence links
- runbook diff
- owners + due dates for residual work
Advanced anti-patterns
Retro written, controls not shipped
Learning without controls does not change risk.
Test added without ownership update
The system improves, but the team contract stays weak.
Marking done before adoption check
Guardrails not used in real workflow do not protect production.
Quick checklist
Before closing hardening loop:
- one new automation guardrail merged
- one regression test merged
- runbook update published
- owner acknowledgement recorded
Claude accelerates recovery work. Hardening is what compounds reliability.
Hardening portfolio prioritization matrix
Prioritize hardening actions with two axes:
- Risk reduction magnitude (how much future incident probability drops)
- Time to reliable adoption (how quickly teams can run this in normal flow)
| Priority band | Risk reduction | Adoption speed | Typical action |
|---|---|---|---|
| P0 | high | fast | add blocking policy check, add rollback validation gate |
| P1 | high | medium | add regression suite for root-cause class |
| P2 | medium | fast | improve runbook clarity, handoff templates |
| P3 | medium/low | slow | broader architecture cleanups |
Start with P0/P1 before polishing documentation language.
Hardening experiment design
Treat each hardening action as a small experiment:
- Hypothesis: "If we add X control, Y failure class drops by Z%."
- Measure: define metric and observation window.
- Guardrail: define rollback condition for the control itself.
- Owner: one execution owner and one verifier owner.
- Review date: fixed date, not "later".
Example
- hypothesis: mandatory pre-release rollback rehearsal reduces SEV-1 MTTR variance
- measure: p95 MTTR over next 4 incidents/drills
- guardrail: if release lead time regresses >20% for 2 cycles, recalibrate gate
Ownership operating rhythm
Weekly reliability standup (20 minutes)
- review overdue hardening items
- review items with weak evidence links
- reassign blocked items immediately
- close items only with proof artifact links
Monthly control audit
- confirm controls are still running (not silently bypassed)
- confirm owners are still valid (team changes happen)
- retire controls that no longer reduce observed risk
Hardening quality rubric (0–2 per row)
| Dimension | 0 | 1 | 2 |
|---|---|---|---|
| Clarity | vague task | partial scope | precise change + boundary |
| Evidence | none | screenshot only | command/log/proof links |
| Ownership | missing | single owner only | owner + verifier + due date |
| Adoption | unverified | anecdotal | observed in real workflow |
| Risk impact | unknown | assumed | measured trend impact |
Minimum passing total: 7/10. Otherwise keep loop open.
Hardening backlog template
### Hardening Item
- Incident reference:
- Failure class:
- Proposed control:
- Priority band: P0/P1/P2/P3
- Execution owner:
- Verifier owner:
- Due date:
- Evidence link(s):
- Adoption check date:
- Status: open | in_progress | verified | retiredCommon failure modes in hardening programs
- backlog closes with no measurable effect on metrics
- controls are added but not socialized to operators
- runbook updates happen without updating on-call checklists
- ownership stays with a departed engineer
Solve these with explicit monthly audit, not ad-hoc memory.
From incident finding to control design
Convert findings with this mapping:
| Finding type | Control candidate | Verification style |
|---|---|---|
| missed detection | alert rule + routing update | synthetic alert replay |
| late decision | checkpoint SLA + commander script | drill timing audit |
| slow recovery | rollback rehearsal gate | timed rollback run |
| repeated regression | test harness + policy check | regression suite trend |
This mapping keeps hardening concrete.
14-day hardening sprint template
- Day 1–2: rank findings by risk and reversibility
- Day 3–5: implement top P0/P1 controls
- Day 6–8: add verification coverage and evidence links
- Day 9–11: run adoption checks in real workflow
- Day 12–14: publish summary + residual risk owners
If schedule slips, reduce scope but keep verification intact.
Residual risk register format
### Residual Risk
- Risk statement:
- Why not fully solved yet:
- Temporary control in place:
- Trigger for re-prioritization:
- Owner:
- Target resolution date:Residual risk without trigger/date is unmanaged risk.
Control economics framework
Every hardening control has cost. Make tradeoffs explicit.
| Control type | Reliability gain horizon | Ongoing cost | Recommended use |
|---|---|---|---|
| Blocking policy gate | immediate | medium | high-severity repeat failures |
| Regression suite expansion | short/medium | medium/high | root-cause class recurrence |
| Runbook + template upgrade | short | low | ownership ambiguity problems |
| Architecture simplification | medium/long | high | chronic complex failure chains |
Choose controls that improve signal-to-effort ratio first.
Adoption hardening checklist
A control is not complete at merge time. Confirm adoption explicitly.
- owners can execute the new control without side-channel knowledge
- failure path for the control itself is documented
- on-call checklist includes the new control point
- dashboard or log evidence proves the control actually runs
Merged-but-unused controls create false confidence.
Hardening QA gate
Before marking a hardening item verified, require:
- implementation evidence (diff + command output)
- behavioral evidence (drill or incident replay signal)
- ownership evidence (named operator + verifier signoff)
Missing any one keeps status at in_progress.
Control sunset protocol
Controls should not live forever by default.
Retire a control when:
- it no longer improves target metrics
- it duplicates a stronger control
- operational cost exceeds measured risk reduction
Retirement requires a short note with replacement or rationale.
Control rollout sequencing
Roll out new controls in three phases:
- Shadow mode — observe behavior without blocking flow
- Warn mode — surface violations with owner attribution
- Enforce mode — block on defined high-risk conditions
This sequencing avoids abrupt disruption while increasing safety.
Hardening adoption score
Track adoption with a simple score (0–5):
- control enabled in target environment
- owners trained and acknowledged
- runbook updated with execution steps
- dashboard signal confirms runtime execution
- one real incident/drill used the control
Score below 4 means adoption is incomplete.
Reliability debt ledger
Keep a dedicated ledger for deferred hardening work:
| Debt item | Why deferred | Risk if delayed | Revisit date |
|---|
Deferred work without revisit dates tends to become permanent risk.