Claude Resilience Metrics and SLOs

Official References: Best Practices · Hooks · Security · GitHub Actions

Why resilience metrics matter

Drills and runbooks are necessary but not sufficient. Without metrics, teams cannot tell whether resilience is improving or degrading.

Core resilience metric stack

Metric	What it measures	Typical source
MTTD	detection latency	alert timeline
MTTC	time to containment decision	incident decision log
MTTR	time to restore stable service	deploy + verification logs
Verification freshness	age of final proof before closure	command evidence records
Follow-up closure rate	% of hardening items closed on time	hardening backlog

SLO model by severity class

SEV-1: containment decision and rollback path triggered under strict time budget
SEV-2: user-impacting degradation stabilized within team-defined window
SEV-3: corrective release completed within planned cycle

Write SLO targets as explicit numbers, not adjectives.

Data collection protocol

For every incident, record:

start timestamp
first alert timestamp
first mitigation decision timestamp
stable-state confirmation timestamp
closure timestamp

Missing timestamps invalidate trend analysis.

Weekly resilience review

Every week:

review outliers in MTTD/MTTR
inspect missed or delayed follow-ups
map failures to control backlog buckets
assign owner and deadline for top regressions

Threshold-driven escalation rules

Define red/yellow/green thresholds for each metric. When red is hit:

open escalation immediately
assign reliability owner
force next-week re-check

Dashboard design rules

show trend, not just latest value
separate severity classes
include denominator/context
link each spike to incident record

Metrics without context create false narratives.

Quarterly calibration

Every quarter:

raise SLO targets only after stable attainment
retire metrics that do not influence decisions
add one metric for newly observed failure class

A smaller useful dashboard beats a large ignored dashboard.

Advanced anti-patterns

Reporting only averages

Averages hide tail-risk behavior.

SLO set without ownership

Unowned SLOs become decorative numbers.

Closing incidents with no freshness check

Old evidence cannot support current closure confidence.

Quick checklist

Before monthly reliability review:

metric definitions documented
severity-specific SLOs visible
threshold breaches mapped to owners
follow-up closure trend reviewed

Claude helps teams move fast. Metrics ensure they improve safely.

Metric dictionary (required fields)

Define each metric with the same schema:

### Metric Definition
- Name:
- Purpose:
- Formula:
- Data source:
- Collection cadence:
- Owner:
- Red threshold:
- Yellow threshold:
- Expected action on breach:

Ambiguous metric definitions create endless debate during incidents.

Error-budget style SLO policy

For each severity class, define an operational budget:

allowed breach count per quarter
mandatory escalation threshold
freeze rule when budget is exhausted

Example policy

SEV-1: zero tolerance for missed containment window
SEV-2: two breaches per quarter before mandatory control review
SEV-3: tracked for trend, not immediate freeze

Trend review prompts (weekly)

Use consistent prompts in weekly review:

Which metric moved most versus baseline?
Is the movement signal or noise (sample size check)?
Which owner needs to act this week?
Which control portfolio bucket receives the action?
What result should be visible by next review?

Escalation mapping table

Breach type	Immediate owner	Secondary owner	SLA for response
Detection breach (MTTD red)	observability owner	incident commander	24h
Decision delay (MTTC red)	incident commander	release owner	same day
Recovery delay (MTTR red)	platform owner	service owner	24h
Freshness breach	verifier owner	commander	same day
Follow-up closure breach	reliability owner	team lead	72h

Executive summary format (monthly)

### Monthly Resilience Summary
- Top improving metric:
- Top regressing metric:
- Repeated breach classes:
- Controls added this month:
- Controls retired this month:
- Ownership risks:
- Next-month focus:

Keep this short and decision-oriented.

Data quality checks

Before trusting metric dashboards, verify:

missing timestamps ratio
duplicate incident IDs
inconsistent severity labels
stale data source refresh time

A precise metric on broken data is still misleading.

Advanced anti-gaming rules

never grade teams by single metric rank
require evidence links for major metric improvements
review tail percentiles before celebrating averages
tie rewards to sustained trend, not one-week spikes

This preserves metric integrity under organizational pressure.

Metric review board operating rule

Run a monthly reliability board with three outputs only:

keep — metric still drives action
change — metric definition/threshold needs revision
remove — metric has no decision value

This avoids dashboard sprawl.

Tail-risk tracking

In addition to average values, track:

p90 / p95 / p99 for MTTD and MTTR
longest open follow-up age
worst-severity breach recurrence interval

Tail views expose the incidents that matter most.

SLO breach playbook

When breach occurs:

open breach record within same day
assign owner and verifier
define corrective control candidate
set review checkpoint within 7 days

Close breach records only with evidence of control effect.

Metric retirement criteria

Retire a metric when all are true:

no action taken from it for 2 quarters
overlaps strongly with another metric
stakeholders cannot explain how they use it

Retire with a note, not silent deletion.

Metric-to-action contract

Every tracked metric must have a predefined action path.

Metric state	Mandatory action	Owner
Green stable	monitor only	metric owner
Yellow drift	open investigation note	reliability owner
Red breach	execute escalation playbook	commander + service owner