Claude Post-Incident Hardening Loop

Official References: Best Practices · Hooks · Security · GitHub Actions

Why this loop matters

Incident response restores service. Hardening prevents repeat incidents.

Without a hardening loop, every incident stays expensive.

Time-boxed hardening timeline

Window	Objective	Required output
0–24h	preserve signal	timeline + evidence packet
24–72h	close the top failure mode	one guardrail automation + one regression test
3–14d	reduce class-level risk	runbook/process update + owner-confirmed checklist

Hardening backlog model

Classify actions into three buckets:

Detect faster: alert thresholds, dashboard hints, pager routing
Prevent recurrence: tests, static checks, policy hooks
Recover faster: rollback script reliability, handoff templates, owner maps

Each item must have one owner and one due date.

Three-lane hardening execution

Controls lane: adds hooks, checks, and CI guardrails
Quality lane: adds regression coverage for root-cause class
Operations lane: updates runbook + communication protocol

Keep lanes independent when possible to reduce merge friction.

Readiness scorecard

Use a compact scorecard (0 or 1 per row):

root-cause class covered by test
rollback path rehearsed
escalation owner documented
dashboard/alerts updated
runbook updated and acknowledged

Score < 4 means hardening is incomplete.

Closure decision rule

Close hardening only when all are true:

scorecard reaches target
new guardrails are running in normal workflow
follow-up owners confirm adoption
no unresolved high-risk action remains

Evidence packet template

At hardening completion, archive:

incident id + summary
implemented controls
test evidence links
runbook diff
owners + due dates for residual work

Advanced anti-patterns

Retro written, controls not shipped

Learning without controls does not change risk.

Test added without ownership update

The system improves, but the team contract stays weak.

Marking done before adoption check

Guardrails not used in real workflow do not protect production.

Quick checklist

Before closing hardening loop:

one new automation guardrail merged
one regression test merged
runbook update published
owner acknowledgement recorded

Claude accelerates recovery work. Hardening is what compounds reliability.

Hardening portfolio prioritization matrix

Prioritize hardening actions with two axes:

Risk reduction magnitude (how much future incident probability drops)
Time to reliable adoption (how quickly teams can run this in normal flow)

Priority band	Risk reduction	Adoption speed	Typical action
P0	high	fast	add blocking policy check, add rollback validation gate
P1	high	medium	add regression suite for root-cause class
P2	medium	fast	improve runbook clarity, handoff templates
P3	medium/low	slow	broader architecture cleanups

Start with P0/P1 before polishing documentation language.

Hardening experiment design

Treat each hardening action as a small experiment:

Hypothesis: "If we add X control, Y failure class drops by Z%."
Measure: define metric and observation window.
Guardrail: define rollback condition for the control itself.
Owner: one execution owner and one verifier owner.
Review date: fixed date, not "later".

Example

hypothesis: mandatory pre-release rollback rehearsal reduces SEV-1 MTTR variance
measure: p95 MTTR over next 4 incidents/drills
guardrail: if release lead time regresses >20% for 2 cycles, recalibrate gate

Ownership operating rhythm

Weekly reliability standup (20 minutes)

review overdue hardening items
review items with weak evidence links
reassign blocked items immediately
close items only with proof artifact links

Monthly control audit

confirm controls are still running (not silently bypassed)
confirm owners are still valid (team changes happen)
retire controls that no longer reduce observed risk

Hardening quality rubric (0–2 per row)

Dimension	0	1	2
Clarity	vague task	partial scope	precise change + boundary
Evidence	none	screenshot only	command/log/proof links
Ownership	missing	single owner only	owner + verifier + due date
Adoption	unverified	anecdotal	observed in real workflow
Risk impact	unknown	assumed	measured trend impact

Minimum passing total: 7/10. Otherwise keep loop open.

Hardening backlog template

### Hardening Item
- Incident reference:
- Failure class:
- Proposed control:
- Priority band: P0/P1/P2/P3
- Execution owner:
- Verifier owner:
- Due date:
- Evidence link(s):
- Adoption check date:
- Status: open | in_progress | verified | retired

Common failure modes in hardening programs

backlog closes with no measurable effect on metrics
controls are added but not socialized to operators
runbook updates happen without updating on-call checklists
ownership stays with a departed engineer

Solve these with explicit monthly audit, not ad-hoc memory.

From incident finding to control design

Convert findings with this mapping:

Finding type	Control candidate	Verification style
missed detection	alert rule + routing update	synthetic alert replay
late decision	checkpoint SLA + commander script	drill timing audit
slow recovery	rollback rehearsal gate	timed rollback run
repeated regression	test harness + policy check	regression suite trend

This mapping keeps hardening concrete.

14-day hardening sprint template

Day 1–2: rank findings by risk and reversibility
Day 3–5: implement top P0/P1 controls
Day 6–8: add verification coverage and evidence links
Day 9–11: run adoption checks in real workflow
Day 12–14: publish summary + residual risk owners

If schedule slips, reduce scope but keep verification intact.

Residual risk register format

### Residual Risk
- Risk statement:
- Why not fully solved yet:
- Temporary control in place:
- Trigger for re-prioritization:
- Owner:
- Target resolution date:

Residual risk without trigger/date is unmanaged risk.

Control economics framework

Every hardening control has cost. Make tradeoffs explicit.

Control type	Reliability gain horizon	Ongoing cost	Recommended use
Blocking policy gate	immediate	medium	high-severity repeat failures
Regression suite expansion	short/medium	medium/high	root-cause class recurrence
Runbook + template upgrade	short	low	ownership ambiguity problems
Architecture simplification	medium/long	high	chronic complex failure chains