Gemini Incident Recovery Playbook

Official References: Get Started · CLI Commands · Sub-agents · Skills

Why this playbook exists

Production incidents punish teams that cannot separate speed from control.

This playbook defines how Gemini CLI teams recover quickly without losing auditability.

Severity matrix

Level	Typical impact	Default response
SEV-1	core user journey or data trust compromised	rollback-first evaluation
SEV-2	major workflow degraded	controlled mitigation + hotfix lane
SEV-3	isolated regression	scheduled corrective release

First-15-minutes standard

assign incident owner and deputy
freeze risky merges touching the affected domain
capture baseline evidence (failing command, logs, blast radius)
open lane tracker with next-owner fields

If next-owner fields are missing, escalation quality collapses.

Four-lane incident structure

Triage lane: classify impact and confidence
Mitigation lane: feature flag / hotfix / rollback candidate
Verification lane: reproducible reruns and safety checks
Comms lane: update cadence for internal/external stakeholders

Keep lane write scopes disjoint whenever possible.

Decision policy

Prefer the most reversible option that restores user safety first.

Decision order:

disable risky path by flag/config
targeted patch with bounded scope
full rollback to known-good revision

Incident evidence packet

At every major decision point, save:

severity, owner, timestamp
impacted systems and users
executed commands and outputs
residual risk statement
next checkpoint time

Post-incident hardening loop

Within one day:

publish timeline with key decision pivots
add at least one guardrail automation
add one regression test for root cause class
assign follow-up owner and due date

No hardening task means the incident is likely to repeat.

Advanced anti-patterns

Parallel edits in shared files without ownership map

This creates merge pressure exactly when speed matters most.

Declaring "resolved" from intuition only

Use rerun evidence, not confidence language.

Communication updates only at incident end

Stakeholders need cadence, not one big summary.

Quick checklist

Before closure:

severity and owner logged
mitigation path justified
rerun evidence attached
hardening follow-up assigned

Gemini can accelerate response. This playbook keeps the response governable.

Scenario library (real pressure patterns)

Use this library to practice the playbook under realistic pressure instead of ideal paths.

Scenario A — Auth token failure after release

Signal: sudden spike in 401/403 from one service after deployment.
Risk: support load surge + customer trust degradation.
Decision branch: feature-flag rollback vs targeted hotfix.
Evidence minimum: failing endpoint list, first bad deploy hash, rollback safety check output.
Owner handoff: investigation owner → mitigation owner → verification owner.

Scenario B — Billing state mismatch

Signal: dashboard revenue events diverge from transaction ledger.
Risk: financial and legal exposure.
Decision branch: immediate write freeze vs selective replay.
Evidence minimum: mismatch sample size, blast radius estimate, safe replay boundary.
Owner handoff: data owner + incident commander dual sign-off required.

Scenario C — Permission leak in edge flow

Signal: low-volume but severe authorization bypass report.
Risk: security incident escalation.
Decision branch: emergency rollback and access revocation first, then patch.
Evidence minimum: reproducible minimal case, revocation completion proof, audit trail snapshot.

War-room artifact bundle (copy/paste)

### Incident Command Snapshot
- Incident ID:
- Severity:
- Incident commander:
- Current phase: detection | containment | recovery | hardening
- Latest stable revision:
- Candidate mitigation path:
- Risk if we wait 30 minutes:
- Next checkpoint at:
- Next owner:

### Checkpoint Decision Record
- Timestamp:
- Evidence reviewed:
- Decision: continue mitigation | rollback | escalate+pause
- Why this decision now:
- Rejected alternatives:
- Owner for execution:
- Owner for verification:

Escalation message templates

Internal engineering escalation

[INCIDENT][SEV-X] <short summary>
Impact: <users/systems>
Current decision: <path>
Immediate ask: <approval/resource>
Next update: <time>
Owner: <name>

Stakeholder update

Status: investigating/mitigating/recovered
Customer impact: <plain language>
Current mitigation: <plain language>
Known unknowns: <top 1-2>
Next committed update time: <time>

Closure acceptance criteria (hard gate)

Close incident response only when all conditions are true:

mitigation path executed and verified with fresh command evidence
rollback trigger and owner are still documented for 24h watch window
residual risks are explicit (not "none" by default)
hardening backlog item owners and due dates are assigned
communications timeline is complete and reviewable

If one condition is missing, status stays active.

30-minute post-closure review

Run a short review immediately after closure:

what signal was first but ignored?
which decision checkpoint took longest?
where did ownership become ambiguous?
what one guardrail would have reduced resolution time most?
what one metric should be added or redefined?

Turn answers into backlog items before context fades.

Provider-specific readiness checks

Before running this playbook, enforce these prechecks:

Repository state: identify last known-good revision and confirm rollback path is executable now.
Owner graph: commander, mitigation owner, verifier owner, and communications owner are all named.
Evidence channel: one canonical thread/document where all checkpoint records are written.

Minimal precheck command set

git rev-parse --short HEAD
git log --oneline -n 5
# add your service health check here

Store command output in the first checkpoint record.

Decision quality guardrails

For each checkpoint decision, require three statements:

reversibility statement — how quickly can we undo this decision?
blast-radius statement — what can get worse if this is wrong?
verification statement — what exact signal proves this worked?

If one statement is missing, decision quality is below advanced standard.

Handoff acceptance test

Before handing to next owner, verify:

scope boundary is explicit (what is included/excluded)
unresolved unknowns are listed (not hidden in chat history)
next checkpoint time is committed
failure trigger for immediate rollback is written

Use this quick test to reduce context loss across lanes.

60-minute command timeline blueprint

Use this timeline when severity is unclear but impact is real.

T+00–10: confirm severity hypothesis, freeze risky merges, assign commander + deputy.
T+10–20: establish mitigation branch options and rollback readiness.
T+20–35: execute one branch decisively; avoid parallel contradictory mitigations.
T+35–50: run verification and compare against pre-incident baseline.
T+50–60: publish decision note and commit the next checkpoint schedule.

The point is not speed alone; it is synchronized decision quality under pressure.

Dependency risk matrix

Classify affected dependencies before choosing mitigation.

Dependency class	Failure symptom	Default incident posture
Auth/identity	access denial or bypass	rollback-first + audit capture
Billing/ledger	transaction mismatch	write freeze + reconciliation boundary
Messaging/queue	lag and duplicate processing	flow throttling + replay guard
Observability	missing/late signals	conservative rollback threshold

This matrix prevents overconfident “single-fix” decisions.

Contradiction handling protocol

If two owners provide conflicting mitigation recommendations:

capture both recommendations in one decision record
score reversibility and blast radius for each
choose the more reversible path unless evidence disproves it
schedule a rapid reassessment checkpoint (<=10 minutes)

Conflicts are normal. Unstructured conflicts are dangerous.

Decision confidence ladder

Tag each checkpoint decision with confidence:

L1 (low): incomplete evidence, reversible action only
L2 (medium): partial evidence, bounded mitigation allowed
L3 (high): consistent evidence, broader rollout or closure allowed

Never close incident response with L1 confidence.

Recovery branch strategy matrix

Choose branch strategy deliberately:

Branch type	When to use	Risk
Hotfix branch	isolated code path regression	hidden side-effect risk
Rollback branch	broad uncertainty or safety risk	reintroducing known debt
Containment branch	partial mitigation while investigating	prolonged temporary state

Do not maintain all three branches active without one branch owner.

Incident closure review package

Before final closure, produce one package containing:

timeline (key timestamps + decisions)
mitigation diff summary
verification command bundle
residual risk register
hardening backlog links

This package should let a new owner understand the incident in 10 minutes.

Leadership handoff note template

### Incident Leadership Note
- What happened:
- Why this decision path was chosen:
- Current confidence level (L1/L2/L3):
- What remains uncertain:
- What we need from leadership:
- Owner for next 24h watch:

Short, explicit leadership notes reduce re-litigation of decisions.

Connected Guides