Official References: Get Started · CLI Commands · Sub-agents · Skills
Why this loop matters
An incident can be solved in hours and still repeat next week. Hardening is the loop that prevents recurrence.
Hardening windows
| Window | Priority | Mandatory output |
|---|---|---|
| 0–24h | preserve evidence quality | timeline + lane decisions |
| 24–72h | block the dominant failure mode | one automation guardrail + one regression test |
| 3–14d | strengthen operational contract | runbook/process update + owner adoption check |
Control backlog buckets
- Detection: alert rules, triage visibility, ownership routing
- Prevention: policy checks, static checks, test coverage
- Recovery: rollback reliability, handoff templates, escalation maps
All backlog items need owner, due date, and verification condition.
Hardening lane architecture
- Controls lane: guardrails and policy enforcement
- Quality lane: regression and reproducibility improvements
- Operations lane: runbook/communication updates
Avoid shared-file collisions unless a merge owner is assigned.
Reliability scorecard
Track 0/1 per control:
- root-cause class now test-covered
- rollback path rehearsal completed
- escalation ownership documented
- alerting/dashboard adjusted
- runbook updated + acknowledged
Score 5/5 is ideal. 4/5 is minimum closure threshold.
Closure gate
Declare hardening complete only when:
- scorecard threshold is met
- controls are active in routine workflow
- follow-up owners acknowledge adoption
- no unresolved high-risk action remains
Evidence bundle
Store one completion bundle:
- incident id + short narrative
- controls shipped
- test evidence links
- runbook diff references
- residual action owners + deadlines
Advanced anti-patterns
"We already fixed production"
Fixing production is recovery, not hardening.
Controls without ownership
An unowned guardrail will decay silently.
Closing with no adoption proof
If teams do not use the update, risk stays.
Quick checklist
Before closure:
- one automation control merged
- one regression test merged
- runbook update shared
- owner adoption recorded
Gemini CLI can speed recovery. Hardening makes that speed reliable.
Hardening portfolio prioritization matrix
Prioritize hardening actions with two axes:
- Risk reduction magnitude (how much future incident probability drops)
- Time to reliable adoption (how quickly teams can run this in normal flow)
| Priority band | Risk reduction | Adoption speed | Typical action |
|---|---|---|---|
| P0 | high | fast | add blocking policy check, add rollback validation gate |
| P1 | high | medium | add regression suite for root-cause class |
| P2 | medium | fast | improve runbook clarity, handoff templates |
| P3 | medium/low | slow | broader architecture cleanups |
Start with P0/P1 before polishing documentation language.
Hardening experiment design
Treat each hardening action as a small experiment:
- Hypothesis: "If we add X control, Y failure class drops by Z%."
- Measure: define metric and observation window.
- Guardrail: define rollback condition for the control itself.
- Owner: one execution owner and one verifier owner.
- Review date: fixed date, not "later".
Example
- hypothesis: mandatory pre-release rollback rehearsal reduces SEV-1 MTTR variance
- measure: p95 MTTR over next 4 incidents/drills
- guardrail: if release lead time regresses >20% for 2 cycles, recalibrate gate
Ownership operating rhythm
Weekly reliability standup (20 minutes)
- review overdue hardening items
- review items with weak evidence links
- reassign blocked items immediately
- close items only with proof artifact links
Monthly control audit
- confirm controls are still running (not silently bypassed)
- confirm owners are still valid (team changes happen)
- retire controls that no longer reduce observed risk
Hardening quality rubric (0–2 per row)
| Dimension | 0 | 1 | 2 |
|---|---|---|---|
| Clarity | vague task | partial scope | precise change + boundary |
| Evidence | none | screenshot only | command/log/proof links |
| Ownership | missing | single owner only | owner + verifier + due date |
| Adoption | unverified | anecdotal | observed in real workflow |
| Risk impact | unknown | assumed | measured trend impact |
Minimum passing total: 7/10. Otherwise keep loop open.
Hardening backlog template
### Hardening Item
- Incident reference:
- Failure class:
- Proposed control:
- Priority band: P0/P1/P2/P3
- Execution owner:
- Verifier owner:
- Due date:
- Evidence link(s):
- Adoption check date:
- Status: open | in_progress | verified | retiredCommon failure modes in hardening programs
- backlog closes with no measurable effect on metrics
- controls are added but not socialized to operators
- runbook updates happen without updating on-call checklists
- ownership stays with a departed engineer
Solve these with explicit monthly audit, not ad-hoc memory.
From incident finding to control design
Convert findings with this mapping:
| Finding type | Control candidate | Verification style |
|---|---|---|
| missed detection | alert rule + routing update | synthetic alert replay |
| late decision | checkpoint SLA + commander script | drill timing audit |
| slow recovery | rollback rehearsal gate | timed rollback run |
| repeated regression | test harness + policy check | regression suite trend |
This mapping keeps hardening concrete.
14-day hardening sprint template
- Day 1–2: rank findings by risk and reversibility
- Day 3–5: implement top P0/P1 controls
- Day 6–8: add verification coverage and evidence links
- Day 9–11: run adoption checks in real workflow
- Day 12–14: publish summary + residual risk owners
If schedule slips, reduce scope but keep verification intact.
Residual risk register format
### Residual Risk
- Risk statement:
- Why not fully solved yet:
- Temporary control in place:
- Trigger for re-prioritization:
- Owner:
- Target resolution date:Residual risk without trigger/date is unmanaged risk.
Control economics framework
Every hardening control has cost. Make tradeoffs explicit.
| Control type | Reliability gain horizon | Ongoing cost | Recommended use |
|---|---|---|---|
| Blocking policy gate | immediate | medium | high-severity repeat failures |
| Regression suite expansion | short/medium | medium/high | root-cause class recurrence |
| Runbook + template upgrade | short | low | ownership ambiguity problems |
| Architecture simplification | medium/long | high | chronic complex failure chains |
Choose controls that improve signal-to-effort ratio first.
Adoption hardening checklist
A control is not complete at merge time. Confirm adoption explicitly.
- owners can execute the new control without side-channel knowledge
- failure path for the control itself is documented
- on-call checklist includes the new control point
- dashboard or log evidence proves the control actually runs
Merged-but-unused controls create false confidence.
Hardening QA gate
Before marking a hardening item verified, require:
- implementation evidence (diff + command output)
- behavioral evidence (drill or incident replay signal)
- ownership evidence (named operator + verifier signoff)
Missing any one keeps status at in_progress.
Control sunset protocol
Controls should not live forever by default.
Retire a control when:
- it no longer improves target metrics
- it duplicates a stronger control
- operational cost exceeds measured risk reduction
Retirement requires a short note with replacement or rationale.
Control rollout sequencing
Roll out new controls in three phases:
- Shadow mode — observe behavior without blocking flow
- Warn mode — surface violations with owner attribution
- Enforce mode — block on defined high-risk conditions
This sequencing avoids abrupt disruption while increasing safety.
Hardening adoption score
Track adoption with a simple score (0–5):
- control enabled in target environment
- owners trained and acknowledged
- runbook updated with execution steps
- dashboard signal confirms runtime execution
- one real incident/drill used the control
Score below 4 means adoption is incomplete.
Reliability debt ledger
Keep a dedicated ledger for deferred hardening work:
| Debt item | Why deferred | Risk if delayed | Revisit date |
|---|
Deferred work without revisit dates tends to become permanent risk.