Write a useful runbook · thehardparts.dev

Difficulty: medium
Time horizon: hours to days for an initial version, ongoing refinement afterward
Primary owner: service owner
Confidence: high

At a glanceEP-26

Situation: A team needs operational guidance that works under pressure.
Goal: Enable responders to act more safely and consistently during operational work, especially when stress, ambiguity, or unfamiliarity is high.
Do not use when: the task is so unstable that a fixed runbook would be fiction
Primary owner: service owner
Roles involved: service owneron-call respondersSRE or operations partnertech leadengineering manager when ownership or escalation is unclear

Context

The situation

Deciding whether to reach for this playbook: when it fits, and when it doesn't.

Use when

Conditions where this playbook is the right tool.

A service or recurring operational process lacks clear response guidance
On-call responders depend too much on one expert
Incidents repeatedly start with the same confusion
New engineers need operational ramp-up support

Stakes

Why this matters

What this playbook protects against, and why skipping or half-running it tends to be expensive.

A runbook is not a technical essay. Under pressure, responders need orientation, triage clues, safe next moves, and escalation boundaries. Bad runbooks create false confidence or go unread; good ones reduce ambiguity and time-to-useful-action.

Quality bar

What good looks like

The observable qualities of a team or system that is actually doing this well. Not just going through the motions.

Signs of the playbook done well

A responder can scan the runbook quickly under pressure
The runbook distinguishes diagnosis from mitigation from escalation
It points to trustworthy dashboards, logs, and owners
It matches how the team actually operates
It gets used and updated after real events

Preparation

Before you start

What you need available and true before running the procedure. Skipping this is the most common reason playbooks fail.

Inputs

Material you'll want to gather first.

Current operational workflow or incident type
Frequent failure modes
Dashboards, logs, and alert paths
Team escalation model
Existing tribal knowledge or scattered notes

Prerequisites

Conditions that should be true for this to work.

The operational workflow or incident class is understood well enough to describe
The team knows who owns the service and who escalates when
There is at least a minimal observability surface to reference

Procedure

The procedure

Each step carries its purpose (why it exists), its actions (what you do), and its outputs (what you produce). Read the purpose. It's what keeps the step from degenerating into checklist theatre.

01
Start with the operational question
Design the runbook around the situation a responder is in.
Actions
- State what event or symptom this runbook is for
- Describe when to use it and when not to
- Put the first-minute triage cues near the top
Outputs
- Runbook scope and trigger section
02
Lay out the response path in order
Make the document usable under pressure.
Actions
- Define what to check first, second, and third
- Separate observation steps from mutation steps
- Flag dangerous or irreversible actions clearly
Outputs
- Ordered response flow
03
Link to the real tools and owners
Anchor guidance in live operational reality.
Actions
- Link relevant dashboards, logs, tracing, and feature flag controls
- Name ownership and escalation contacts or roles
- Clarify what conditions require escalation immediately
Outputs
- Tool and escalation references
04
Add context that helps judgment
Prevent the runbook from becoming blindly procedural.
Actions
- Explain common failure modes and false positives briefly
- Note what normal looks like where useful
- Include known caveats from prior incidents
Outputs
- Judgment notes section
05
Test and evolve it in real use
Keep it operationally honest.
Actions
- Use the runbook in shadow or live operational contexts
- Update it after incidents and confusing handoffs
- Remove stale, low-trust, or over-detailed content
Outputs
- Validated runbook
- Runbook improvement log

Judgment

Judgment calls and pitfalls

The places where execution actually diverges: decisions that need thought, questions worth asking, and mistakes that recur regardless of good intent.

Decision points

Moments where judgment and trade-offs matter more than procedure.

What belongs in the top section versus linked context?
What actions are safe for first responders versus only for owners?
Which escalation conditions should be explicit?
How much explanation is enough without slowing usage?

Questions worth asking

Prompts to use on yourself, the team, or an AI assistant while running the procedure.

What does a responder need to know in the first five minutes?
Which actions are safe, and which require explicit escalation?
What confusion from past incidents should this runbook remove?

Common mistakes

Patterns that surface across teams running this playbook.

Writing the runbook like documentation instead of an operational guide
Burying the first useful action below too much context
Including commands without safety boundaries
Letting the runbook drift away from current dashboards or ownership

Warning signs you are doing it wrong

Signals that the playbook is being executed but not landing.

Responders still go directly to a hero instead of the runbook
The runbook is long but rarely used under pressure
Links are stale or tools no longer match the text
The document explains the system but does not help with the operational decision

Outcomes

Outcomes and signals

What should exist after the playbook runs, how you'll know it worked, and what to watch for over time.

Artifacts to produce

Durable outputs the playbook should leave behind.

Runbook scope and trigger section
Ordered response flow
Tool and escalation references
Judgment notes section
Runbook improvement log

Success signals

Observable changes that mean the playbook landed.

Responders reach useful first actions faster
Handoffs are clearer during incidents
Runbook usage increases and hero dependence declines
Updates happen after real operational events

Follow-up actions

Moves that keep the playbook's effects compounding after it finishes.

Connect the runbook to onboarding and on-call practice
Split the runbook if it grows into multiple distinct incident classes
Review whether repeated runbook confusion signals deeper architecture or ownership issues

Metrics or signals to watch

Longer-horizon indicators that the underlying problem is receding.

Time to first useful action for common incidents
Runbook usage during incidents
Stale link or stale ownership findings
Number of hero escalations for covered scenarios

AI impact

AI effects on this playbook

How AI-assisted and AI-driven workflows help execution, and the ways they can make it worse.

AI can help with

Where AI tooling genuinely reduces the cost of running this playbook well.

Drafting first versions from incident history and scattered notes
Summarizing recurring diagnostic patterns
Turning long explanations into scan-friendly operational structure

AI can make worse by

Distortions AI introduces that make the underlying problem harder to see.

Producing plausible but unverified steps
Making an incomplete workflow sound authoritative
Adding too much polished context that gets in the way during stress

Relationships

Connected playbooks

Failure modes this playbook tends to address, decisions behind the situation, red flags that motivate running it, and neighboring playbooks.