Skip to main content
The Hard Parts.dev
EP-25 Operations EP Engineering Playbook
Difficulty medium-high Owner · incident lead

Run an incident review that actually helps

Turn an incident review into a system-learning exercise that explains what happened, why it made sense at the time, what conditions enabled it, and what changes will reduce recurrence.

Difficulty
medium-high
Time horizon
1 to 5 business days after containment, with follow-through over weeks
Primary owner
incident lead
Confidence
high
At a glanceEP-25
Situation
An incident happened and the team needs to learn from it usefully.
Goal
Improve the system, operating model, and team judgment after an incident rather than producing blame, theater, or generic action items.
Do not use when
the incident is still live and response work is not finished
Primary owner
incident lead
Roles involved

incident leadservice ownerengineering managertech leadoperations or SREpeople directly involved in diagnosis and responsestakeholder representative when useful

Context

The situation

Deciding whether to reach for this playbook: when it fits, and when it doesn't.

Use when

Conditions where this playbook is the right tool.

  • A service incident, degraded release, security event, or serious near-miss occurred
  • The same class of issue has happened more than once
  • Stakeholders want to know what changed after the event
  • The team needs a shared factual account of what happened

Stakes

Why this matters

What this playbook protects against, and why skipping or half-running it tends to be expensive.

A weak incident review teaches the wrong lesson: hide uncertainty, compress the story, blame a person, move on. A strong review increases operational truth, system understanding, and future response quality.

Quality bar

What good looks like

The observable qualities of a team or system that is actually doing this well. Not just going through the motions.

Signs of the playbook done well

  • The review explains the timeline clearly without oversimplifying
  • Human decisions are described in context, not as hindsight caricatures
  • Technical causes and organizational conditions are both visible
  • Actions are few, concrete, and connected to the real failure chain
  • People leave with more trust in the learning process, not less

Preparation

Before you start

What you need available and true before running the procedure. Skipping this is the most common reason playbooks fail.

Inputs

Material you'll want to gather first.

  • Incident timeline
  • Logs, traces, alerts, dashboards
  • Deployment and config changes
  • Communication channels used during the incident
  • Customer or stakeholder impact summary
  • Previous incidents of similar shape

Prerequisites

Conditions that should be true for this to work.

  • The incident is contained or stable
  • Basic evidence has been preserved
  • Someone can facilitate neutrally enough to keep the review truthful

Procedure

The procedure

Each step carries its purpose (why it exists), its actions (what you do), and its outputs (what you produce). Read the purpose. It's what keeps the step from degenerating into checklist theatre.

  1. Reconstruct the timeline before debating causes

    Create a shared factual base before interpretation diverges.

    Actions

    • Build a timeline of events, signals, decisions, and observable impact
    • Separate what was known at the time from what became clear later
    • Include detection, escalation, mitigation, rollback, and communication events

    Outputs

    • Incident timeline
  2. Explain why the failure path was possible

    Move beyond the final error into enabling conditions.

    Actions

    • Identify technical contributors, missing controls, unclear ownership, or weak observability
    • Ask why the system allowed this path to remain open
    • Distinguish trigger, amplifier, and recovery friction

    Outputs

    • Failure chain analysis
  3. Describe human behavior in context

    Preserve learning without blame reductionism.

    Actions

    • Describe what responders saw, believed, and prioritized at each stage
    • Capture ambiguity, overload, handoff confusion, and missing context
    • Avoid rewriting decisions with hindsight certainty

    Outputs

    • Human context analysis
  4. Choose a small set of corrective moves

    Make the review operationally useful.

    Actions

    • Select changes that close or reduce the highest-leverage failure paths
    • Separate immediate fixes from systemic follow-up
    • Assign owners and evidence for completion

    Outputs

    • Incident action set
  5. Publish the learning clearly

    Turn the review into shared operational memory.

    Actions

    • Write a concise review artifact for engineers and stakeholders
    • Link actions to observed failure conditions
    • Review follow-through at agreed checkpoints

    Outputs

    • Incident review document
    • Follow-up tracker

Judgment

Judgment calls and pitfalls

The places where execution actually diverges: decisions that need thought, questions worth asking, and mistakes that recur regardless of good intent.

Decision points

Moments where judgment and trade-offs matter more than procedure.

  • What was the trigger versus the enabling condition?
  • Which conditions most increased impact or slowed recovery?
  • What action would most reduce recurrence or blast radius?
  • What lessons should change guidance, tooling, or ownership norms?

Questions worth asking

Prompts to use on yourself, the team, or an AI assistant while running the procedure.

  • What did we know at each stage versus what we know now?
  • Why did this make sense to responders in the moment?
  • What single condition most increased impact or slowed recovery?

Common mistakes

Patterns that surface across teams running this playbook.

  • Stopping at a shallow root cause
  • Writing an incident story that makes responders look irrational in hindsight
  • Creating too many weak actions
  • Equating document completion with learning completion
  • Publishing a sanitized version upward and a different truth locally

Warning signs you are doing it wrong

Signals that the playbook is being executed but not landing.

  • The review blames one person or one deploy and ends there
  • Actions sound generic, such as improve monitoring or communicate better
  • The same class of incident keeps recurring with new labels
  • People become less willing to speak openly after reviews

Outcomes

Outcomes and signals

What should exist after the playbook runs, how you'll know it worked, and what to watch for over time.

Artifacts to produce

Durable outputs the playbook should leave behind.

  • Incident timeline
  • Failure chain analysis
  • Human context analysis
  • Incident action set
  • Incident review document
  • Follow-up tracker

Success signals

Observable changes that mean the playbook landed.

  • Repeat incidents in the same class decline
  • Responders feel the account is fair and accurate
  • Stakeholders can see what changed after the incident
  • Future response gets faster or clearer in the affected areas

Follow-up actions

Moves that keep the playbook's effects compounding after it finishes.

  • Fold durable learnings into runbooks, onboarding, and release guidance
  • Promote repeated review themes into architecture or ownership work
  • Check whether action items reduced the targeted risk in practice

Metrics or signals to watch

Longer-horizon indicators that the underlying problem is receding.

  • Repeat incident rate by class
  • Time to detect and mitigate in similar incidents
  • Action completion rate with evidence
  • Stakeholder confidence in incident handling

AI impact

AI effects on this playbook

How AI-assisted and AI-driven workflows help execution, and the ways they can make it worse.

AI can help with

Where AI tooling genuinely reduces the cost of running this playbook well.

  • Assembling timelines from chat, logs, and deployment events
  • Clustering symptoms and correlating signals across large evidence sets
  • Drafting first-pass incident summaries and action templates

AI can make worse by

Distortions AI introduces that make the underlying problem harder to see.

  • Compressing nuance into overly clean narratives
  • Masking uncertainty with confident summaries
  • Making a weak review look complete before the hard causal work is done

Relationships

Connected playbooks

Failure modes this playbook tends to address, decisions behind the situation, red flags that motivate running it, and neighboring playbooks.