Official References: Best Practices · Review · Worktrees · Automations
Why measurement is a resilience control
Reliability drifts gradually before it breaks visibly. Metrics are the early-warning system for that drift.
Core resilience metric set
| Metric | Decision question | Typical source |
|---|---|---|
| MTTD | are we detecting fast enough? | alert timeline |
| MTTC | are decisions delayed? | incident command log |
| MTTR | how fast do we restore stable service? | deploy/verification timeline |
| Evidence freshness | is closure proof current? | verification artifact log |
| Follow-up closure rate | are hardening commitments actually closing? | post-incident backlog |
Severity-based SLO design
- SEV-1: strict containment and rollback activation window
- SEV-2: major degradation stabilization window
- SEV-3: planned corrective release completion window
SLO definitions must be numeric and owner-assigned.
Measurement protocol
Record these timestamps every incident:
- incident start
- first detection
- first mitigation decision
- stable-state confirmation
- closure
Incomplete timestamp chains reduce analysis credibility.
Weekly reliability operating review
- examine threshold breaches
- inspect tail-latency outliers
- map misses to control portfolio buckets
- assign owner + due date for highest-impact regressions
Threshold escalation model
Per metric, define green/yellow/red boundaries. Red boundary hit requires:
- immediate escalation record
- named reliability owner
- forced re-check in the next cycle
Dashboard quality rules
- show trend and percentile bands
- separate severity classes
- include denominator and sample size
- link each spike to incident notes
Quarterly calibration loop
Every quarter:
- raise SLO targets only after sustained attainment
- retire non-actionable metrics
- add one metric for newly observed failure class
The goal is decision quality, not dashboard size.
Advanced anti-patterns
Averaging away tail risk
Mean values hide the events that cause incidents.
SLOs with no accountable owner
Unowned SLOs degrade into status theater.
Closing with stale evidence
Old proof cannot justify present-state confidence.
Quick checklist
Before monthly resilience review:
- metric definitions current
- severity SLOs published and owned
- red/yellow breaches explicitly assigned
- follow-up closure trend reviewed
Codex accelerates execution speed. Metrics preserve reliability integrity.
Metric dictionary (required fields)
Define each metric with the same schema:
### Metric Definition
- Name:
- Purpose:
- Formula:
- Data source:
- Collection cadence:
- Owner:
- Red threshold:
- Yellow threshold:
- Expected action on breach:Ambiguous metric definitions create endless debate during incidents.
Error-budget style SLO policy
For each severity class, define an operational budget:
- allowed breach count per quarter
- mandatory escalation threshold
- freeze rule when budget is exhausted
Example policy
- SEV-1: zero tolerance for missed containment window
- SEV-2: two breaches per quarter before mandatory control review
- SEV-3: tracked for trend, not immediate freeze
Trend review prompts (weekly)
Use consistent prompts in weekly review:
- Which metric moved most versus baseline?
- Is the movement signal or noise (sample size check)?
- Which owner needs to act this week?
- Which control portfolio bucket receives the action?
- What result should be visible by next review?
Escalation mapping table
| Breach type | Immediate owner | Secondary owner | SLA for response |
|---|---|---|---|
| Detection breach (MTTD red) | observability owner | incident commander | 24h |
| Decision delay (MTTC red) | incident commander | release owner | same day |
| Recovery delay (MTTR red) | platform owner | service owner | 24h |
| Freshness breach | verifier owner | commander | same day |
| Follow-up closure breach | reliability owner | team lead | 72h |
Executive summary format (monthly)
### Monthly Resilience Summary
- Top improving metric:
- Top regressing metric:
- Repeated breach classes:
- Controls added this month:
- Controls retired this month:
- Ownership risks:
- Next-month focus:Keep this short and decision-oriented.
Data quality checks
Before trusting metric dashboards, verify:
- missing timestamps ratio
- duplicate incident IDs
- inconsistent severity labels
- stale data source refresh time
A precise metric on broken data is still misleading.
Advanced anti-gaming rules
- never grade teams by single metric rank
- require evidence links for major metric improvements
- review tail percentiles before celebrating averages
- tie rewards to sustained trend, not one-week spikes
This preserves metric integrity under organizational pressure.
Metric review board operating rule
Run a monthly reliability board with three outputs only:
- keep — metric still drives action
- change — metric definition/threshold needs revision
- remove — metric has no decision value
This avoids dashboard sprawl.
Tail-risk tracking
In addition to average values, track:
- p90 / p95 / p99 for MTTD and MTTR
- longest open follow-up age
- worst-severity breach recurrence interval
Tail views expose the incidents that matter most.
SLO breach playbook
When breach occurs:
- open breach record within same day
- assign owner and verifier
- define corrective control candidate
- set review checkpoint within 7 days
Close breach records only with evidence of control effect.
Metric retirement criteria
Retire a metric when all are true:
- no action taken from it for 2 quarters
- overlaps strongly with another metric
- stakeholders cannot explain how they use it
Retire with a note, not silent deletion.
Metric-to-action contract
Every tracked metric must have a predefined action path.
| Metric state | Mandatory action | Owner |
|---|---|---|
| Green stable | monitor only | metric owner |
| Yellow drift | open investigation note | reliability owner |
| Red breach | execute escalation playbook | commander + service owner |
No action contract means the metric is decorative.
SLO negotiation rubric
When teams disagree on SLO targets, resolve with rubric:
- customer impact severity
- current system capability baseline
- reversibility of failures in that domain
- operational cost to meet tighter target
Choose SLOs by risk economics, not optimism.
Data reliability checks for dashboards
Run weekly checks on the measurement system itself:
- timestamp completeness ratio
- severity label consistency
- duplicate incident record rate
- source refresh delay
A resilient team measures both service reliability and metric reliability.
Executive narrative template
Tie metrics to action every month:
- what degraded
- what control was added
- what improved after control
- what remains high-risk
- who owns next correction
Leadership needs this chain to fund the right fixes.
Metric ownership rotation policy
Rotate secondary metric owners quarterly while keeping one stable primary owner.
- primary owner keeps continuity
- rotating secondary owner provides fresh challenge and catches blind spots
This prevents metric stagnation.
Forecasting with resilience trends
Add a monthly forecast section:
- expected MTTD/MTTR band next month
- top breach risk by severity class
- confidence level of forecast
- planned controls influencing forecast
Forecasting turns metrics from reporting into planning.
Alert-to-metric reconciliation
Weekly reconcile:
- alerts that triggered but did not map to incidents
- incidents discovered without corresponding alert
- breaches not represented on dashboard
Gaps here indicate monitoring-model drift.