Skip to main content
The Hard Parts.dev
EP-26 Operations EP Engineering Playbook
Difficulty medium Owner · service owner

Write a useful runbook

Write a runbook for real operational use: quick orientation, clear triggers, diagnostic paths, safe actions, escalation criteria, and links to trustworthy deeper context.

Difficulty
medium
Time horizon
hours to days for an initial version, ongoing refinement afterward
Primary owner
service owner
Confidence
high
At a glanceEP-26
Situation
A team needs operational guidance that works under pressure.
Goal
Enable responders to act more safely and consistently during operational work, especially when stress, ambiguity, or unfamiliarity is high.
Do not use when
the task is so unstable that a fixed runbook would be fiction
Primary owner
service owner
Roles involved

service owneron-call respondersSRE or operations partnertech leadengineering manager when ownership or escalation is unclear

Context

The situation

Deciding whether to reach for this playbook: when it fits, and when it doesn't.

Use when

Conditions where this playbook is the right tool.

  • A service or recurring operational process lacks clear response guidance
  • On-call responders depend too much on one expert
  • Incidents repeatedly start with the same confusion
  • New engineers need operational ramp-up support

Stakes

Why this matters

What this playbook protects against, and why skipping or half-running it tends to be expensive.

A runbook is not a technical essay. Under pressure, responders need orientation, triage clues, safe next moves, and escalation boundaries. Bad runbooks create false confidence or go unread; good ones reduce ambiguity and time-to-useful-action.

Quality bar

What good looks like

The observable qualities of a team or system that is actually doing this well. Not just going through the motions.

Signs of the playbook done well

  • A responder can scan the runbook quickly under pressure
  • The runbook distinguishes diagnosis from mitigation from escalation
  • It points to trustworthy dashboards, logs, and owners
  • It matches how the team actually operates
  • It gets used and updated after real events

Preparation

Before you start

What you need available and true before running the procedure. Skipping this is the most common reason playbooks fail.

Inputs

Material you'll want to gather first.

  • Current operational workflow or incident type
  • Frequent failure modes
  • Dashboards, logs, and alert paths
  • Team escalation model
  • Existing tribal knowledge or scattered notes

Prerequisites

Conditions that should be true for this to work.

  • The operational workflow or incident class is understood well enough to describe
  • The team knows who owns the service and who escalates when
  • There is at least a minimal observability surface to reference

Procedure

The procedure

Each step carries its purpose (why it exists), its actions (what you do), and its outputs (what you produce). Read the purpose. It's what keeps the step from degenerating into checklist theatre.

  1. Start with the operational question

    Design the runbook around the situation a responder is in.

    Actions

    • State what event or symptom this runbook is for
    • Describe when to use it and when not to
    • Put the first-minute triage cues near the top

    Outputs

    • Runbook scope and trigger section
  2. Lay out the response path in order

    Make the document usable under pressure.

    Actions

    • Define what to check first, second, and third
    • Separate observation steps from mutation steps
    • Flag dangerous or irreversible actions clearly

    Outputs

    • Ordered response flow
  3. Link to the real tools and owners

    Anchor guidance in live operational reality.

    Actions

    • Link relevant dashboards, logs, tracing, and feature flag controls
    • Name ownership and escalation contacts or roles
    • Clarify what conditions require escalation immediately

    Outputs

    • Tool and escalation references
  4. Add context that helps judgment

    Prevent the runbook from becoming blindly procedural.

    Actions

    • Explain common failure modes and false positives briefly
    • Note what normal looks like where useful
    • Include known caveats from prior incidents

    Outputs

    • Judgment notes section
  5. Test and evolve it in real use

    Keep it operationally honest.

    Actions

    • Use the runbook in shadow or live operational contexts
    • Update it after incidents and confusing handoffs
    • Remove stale, low-trust, or over-detailed content

    Outputs

    • Validated runbook
    • Runbook improvement log

Judgment

Judgment calls and pitfalls

The places where execution actually diverges: decisions that need thought, questions worth asking, and mistakes that recur regardless of good intent.

Decision points

Moments where judgment and trade-offs matter more than procedure.

  • What belongs in the top section versus linked context?
  • What actions are safe for first responders versus only for owners?
  • Which escalation conditions should be explicit?
  • How much explanation is enough without slowing usage?

Questions worth asking

Prompts to use on yourself, the team, or an AI assistant while running the procedure.

  • What does a responder need to know in the first five minutes?
  • Which actions are safe, and which require explicit escalation?
  • What confusion from past incidents should this runbook remove?

Common mistakes

Patterns that surface across teams running this playbook.

  • Writing the runbook like documentation instead of an operational guide
  • Burying the first useful action below too much context
  • Including commands without safety boundaries
  • Letting the runbook drift away from current dashboards or ownership

Warning signs you are doing it wrong

Signals that the playbook is being executed but not landing.

  • Responders still go directly to a hero instead of the runbook
  • The runbook is long but rarely used under pressure
  • Links are stale or tools no longer match the text
  • The document explains the system but does not help with the operational decision

Outcomes

Outcomes and signals

What should exist after the playbook runs, how you'll know it worked, and what to watch for over time.

Artifacts to produce

Durable outputs the playbook should leave behind.

  • Runbook scope and trigger section
  • Ordered response flow
  • Tool and escalation references
  • Judgment notes section
  • Runbook improvement log

Success signals

Observable changes that mean the playbook landed.

  • Responders reach useful first actions faster
  • Handoffs are clearer during incidents
  • Runbook usage increases and hero dependence declines
  • Updates happen after real operational events

Follow-up actions

Moves that keep the playbook's effects compounding after it finishes.

  • Connect the runbook to onboarding and on-call practice
  • Split the runbook if it grows into multiple distinct incident classes
  • Review whether repeated runbook confusion signals deeper architecture or ownership issues

Metrics or signals to watch

Longer-horizon indicators that the underlying problem is receding.

  • Time to first useful action for common incidents
  • Runbook usage during incidents
  • Stale link or stale ownership findings
  • Number of hero escalations for covered scenarios

AI impact

AI effects on this playbook

How AI-assisted and AI-driven workflows help execution, and the ways they can make it worse.

AI can help with

Where AI tooling genuinely reduces the cost of running this playbook well.

  • Drafting first versions from incident history and scattered notes
  • Summarizing recurring diagnostic patterns
  • Turning long explanations into scan-friendly operational structure

AI can make worse by

Distortions AI introduces that make the underlying problem harder to see.

  • Producing plausible but unverified steps
  • Making an incomplete workflow sound authoritative
  • Adding too much polished context that gets in the way during stress

Relationships

Connected playbooks

Failure modes this playbook tends to address, decisions behind the situation, red flags that motivate running it, and neighboring playbooks.