Official References: Best Practices · Hooks · Security · GitHub Actions
Why resilience metrics matter
Drills and runbooks are necessary but not sufficient. Without metrics, teams cannot tell whether resilience is improving or degrading.
Core resilience metric stack
| Metric | What it measures | Typical source |
|---|---|---|
| MTTD | detection latency | alert timeline |
| MTTC | time to containment decision | incident decision log |
| MTTR | time to restore stable service | deploy + verification logs |
| Verification freshness | age of final proof before closure | command evidence records |
| Follow-up closure rate | % of hardening items closed on time | hardening backlog |
SLO model by severity class
- SEV-1: containment decision and rollback path triggered under strict time budget
- SEV-2: user-impacting degradation stabilized within team-defined window
- SEV-3: corrective release completed within planned cycle
Write SLO targets as explicit numbers, not adjectives.
Data collection protocol
For every incident, record:
- start timestamp
- first alert timestamp
- first mitigation decision timestamp
- stable-state confirmation timestamp
- closure timestamp
Missing timestamps invalidate trend analysis.
Weekly resilience review
Every week:
- review outliers in MTTD/MTTR
- inspect missed or delayed follow-ups
- map failures to control backlog buckets
- assign owner and deadline for top regressions
Threshold-driven escalation rules
Define red/yellow/green thresholds for each metric. When red is hit:
- open escalation immediately
- assign reliability owner
- force next-week re-check
Dashboard design rules
- show trend, not just latest value
- separate severity classes
- include denominator/context
- link each spike to incident record
Metrics without context create false narratives.
Quarterly calibration
Every quarter:
- raise SLO targets only after stable attainment
- retire metrics that do not influence decisions
- add one metric for newly observed failure class
A smaller useful dashboard beats a large ignored dashboard.
Advanced anti-patterns
Reporting only averages
Averages hide tail-risk behavior.
SLO set without ownership
Unowned SLOs become decorative numbers.
Closing incidents with no freshness check
Old evidence cannot support current closure confidence.
Quick checklist
Before monthly reliability review:
- metric definitions documented
- severity-specific SLOs visible
- threshold breaches mapped to owners
- follow-up closure trend reviewed
Claude helps teams move fast. Metrics ensure they improve safely.
Metric dictionary (required fields)
Define each metric with the same schema:
### Metric Definition
- Name:
- Purpose:
- Formula:
- Data source:
- Collection cadence:
- Owner:
- Red threshold:
- Yellow threshold:
- Expected action on breach:Ambiguous metric definitions create endless debate during incidents.
Error-budget style SLO policy
For each severity class, define an operational budget:
- allowed breach count per quarter
- mandatory escalation threshold
- freeze rule when budget is exhausted
Example policy
- SEV-1: zero tolerance for missed containment window
- SEV-2: two breaches per quarter before mandatory control review
- SEV-3: tracked for trend, not immediate freeze
Trend review prompts (weekly)
Use consistent prompts in weekly review:
- Which metric moved most versus baseline?
- Is the movement signal or noise (sample size check)?
- Which owner needs to act this week?
- Which control portfolio bucket receives the action?
- What result should be visible by next review?
Escalation mapping table
| Breach type | Immediate owner | Secondary owner | SLA for response |
|---|---|---|---|
| Detection breach (MTTD red) | observability owner | incident commander | 24h |
| Decision delay (MTTC red) | incident commander | release owner | same day |
| Recovery delay (MTTR red) | platform owner | service owner | 24h |
| Freshness breach | verifier owner | commander | same day |
| Follow-up closure breach | reliability owner | team lead | 72h |
Executive summary format (monthly)
### Monthly Resilience Summary
- Top improving metric:
- Top regressing metric:
- Repeated breach classes:
- Controls added this month:
- Controls retired this month:
- Ownership risks:
- Next-month focus:Keep this short and decision-oriented.
Data quality checks
Before trusting metric dashboards, verify:
- missing timestamps ratio
- duplicate incident IDs
- inconsistent severity labels
- stale data source refresh time
A precise metric on broken data is still misleading.
Advanced anti-gaming rules
- never grade teams by single metric rank
- require evidence links for major metric improvements
- review tail percentiles before celebrating averages
- tie rewards to sustained trend, not one-week spikes
This preserves metric integrity under organizational pressure.
Metric review board operating rule
Run a monthly reliability board with three outputs only:
- keep — metric still drives action
- change — metric definition/threshold needs revision
- remove — metric has no decision value
This avoids dashboard sprawl.
Tail-risk tracking
In addition to average values, track:
- p90 / p95 / p99 for MTTD and MTTR
- longest open follow-up age
- worst-severity breach recurrence interval
Tail views expose the incidents that matter most.
SLO breach playbook
When breach occurs:
- open breach record within same day
- assign owner and verifier
- define corrective control candidate
- set review checkpoint within 7 days
Close breach records only with evidence of control effect.
Metric retirement criteria
Retire a metric when all are true:
- no action taken from it for 2 quarters
- overlaps strongly with another metric
- stakeholders cannot explain how they use it
Retire with a note, not silent deletion.
Metric-to-action contract
Every tracked metric must have a predefined action path.
| Metric state | Mandatory action | Owner |
|---|---|---|
| Green stable | monitor only | metric owner |
| Yellow drift | open investigation note | reliability owner |
| Red breach | execute escalation playbook | commander + service owner |
No action contract means the metric is decorative.
SLO negotiation rubric
When teams disagree on SLO targets, resolve with rubric:
- customer impact severity
- current system capability baseline
- reversibility of failures in that domain
- operational cost to meet tighter target
Choose SLOs by risk economics, not optimism.
Data reliability checks for dashboards
Run weekly checks on the measurement system itself:
- timestamp completeness ratio
- severity label consistency
- duplicate incident record rate
- source refresh delay
A resilient team measures both service reliability and metric reliability.
Executive narrative template
Tie metrics to action every month:
- what degraded
- what control was added
- what improved after control
- what remains high-risk
- who owns next correction
Leadership needs this chain to fund the right fixes.
Metric ownership rotation policy
Rotate secondary metric owners quarterly while keeping one stable primary owner.
- primary owner keeps continuity
- rotating secondary owner provides fresh challenge and catches blind spots
This prevents metric stagnation.
Forecasting with resilience trends
Add a monthly forecast section:
- expected MTTD/MTTR band next month
- top breach risk by severity class
- confidence level of forecast
- planned controls influencing forecast
Forecasting turns metrics from reporting into planning.
Alert-to-metric reconciliation
Weekly reconcile:
- alerts that triggered but did not map to incidents
- incidents discovered without corresponding alert
- breaches not represented on dashboard
Gaps here indicate monitoring-model drift.