Skip to main content
The Hard Parts.dev
EP-27 Operations EP Engineering Playbook
Difficulty medium-high Owner · release owner

Build a practical rollback strategy

Design rollback as a practical recovery system, not a comforting word: understand what can be reversed, what cannot, how quickly, by whom, and under what evidence thresholds.

Difficulty
medium-high
Time horizon
days to establish, then revisited per major change
Primary owner
release owner
Confidence
high
At a glanceEP-27
Situation
A team needs credible recovery options when a release or change goes wrong.
Goal
Reduce damage and decision hesitation when a change must be paused, reversed, or contained.
Do not use when
the system is so small and reversible that ordinary redeploy is sufficient and well understood
Primary owner
release owner
Roles involved

release ownertech leadSRE or operationsdata owner when schema or state is involvedservice ownerbusiness or product contact when rollout exposure matters

Context

The situation

Deciding whether to reach for this playbook: when it fits, and when it doesn't.

Use when

Conditions where this playbook is the right tool.

  • A service or product ships changes that can materially fail in production
  • Deployments involve risky schema, config, or contract changes
  • The team says rollback exists but has not tested it
  • Release confidence depends too heavily on caution rather than reversibility

Stakes

Why this matters

What this playbook protects against, and why skipping or half-running it tends to be expensive.

Rollback only helps if it is real. Teams often discover too late that data changed irreversibly, downstream contracts shifted, or the rollback path is slower and less reliable than the forward path.

Quality bar

What good looks like

The observable qualities of a team or system that is actually doing this well. Not just going through the motions.

Signs of the playbook done well

  • The team knows which kinds of change are reversible, partially reversible, or one-way
  • Rollback actions are time-bounded and owned
  • Rollback decisions use explicit signals rather than panic or delay
  • Data and compatibility risks are accounted for honestly
  • The team can pause or contain even when full rollback is not possible

Preparation

Before you start

What you need available and true before running the procedure. Skipping this is the most common reason playbooks fail.

Inputs

Material you'll want to gather first.

  • Deployment model
  • Change types and typical failure modes
  • Data mutation paths
  • Feature flag and routing controls
  • Release windows and dependencies
  • Current incident response habits

Prerequisites

Conditions that should be true for this to work.

  • The team can describe its deployment and data mutation model
  • Someone owns release decisions operationally
  • The team is willing to label some changes as only partially reversible or one-way

Procedure

The procedure

Each step carries its purpose (why it exists), its actions (what you do), and its outputs (what you produce). Read the purpose. It's what keeps the step from degenerating into checklist theatre.

  1. Classify change reversibility

    Stop pretending all rollback is equal.

    Actions

    • Classify code-only, config, schema, data, and contract changes separately
    • Name which parts are easy to reverse and which are not
    • Document containment options where full rollback is impossible

    Outputs

    • Reversibility matrix
  2. Define rollback mechanics and owners

    Make rollback executable under stress.

    Actions

    • Document who can trigger rollback and how
    • State the exact actions for pause, partial rollback, full rollback, and containment
    • Clarify prerequisites such as backups, flags, or deployment artifacts

    Outputs

    • Rollback procedure set
  3. Tie rollback to operational evidence

    Prevent hesitation and ambiguity during a bad rollout.

    Actions

    • Define signals and thresholds for hold, rollback, or contain
    • Make the decision logic visible in release prep
    • Distinguish user harm, system harm, and business harm triggers

    Outputs

    • Rollback decision sheet
  4. Exercise the path

    Test whether rollback is real or decorative.

    Actions

    • Rehearse rollback scenarios for representative change types
    • Verify tools, permissions, timing, and communication paths
    • Capture where the rollback story breaks down

    Outputs

    • Rollback rehearsal notes
  5. Improve forward design using rollback learnings

    Make future changes safer by architecture, not only procedure.

    Actions

    • Add flags, staged rollouts, decoupled schema paths, or compatibility buffers where needed
    • Reduce one-way changes when avoidable
    • Update release design guidance

    Outputs

    • Rollback-informed design improvements

Judgment

Judgment calls and pitfalls

The places where execution actually diverges: decisions that need thought, questions worth asking, and mistakes that recur regardless of good intent.

Decision points

Moments where judgment and trade-offs matter more than procedure.

  • What does rollback mean for this change type?
  • When is containment safer than rollback?
  • How much reversibility is enough before shipping?
  • What forward-design changes would make future rollback easier?

Questions worth asking

Prompts to use on yourself, the team, or an AI assistant while running the procedure.

  • What exactly can we reverse for this change, and what cannot be undone?
  • Who can trigger rollback, based on which signals?
  • If full rollback is impossible, what containment options exist?

Common mistakes

Patterns that surface across teams running this playbook.

  • Equating rollback with redeploying old code
  • Ignoring irreversible data or downstream side effects
  • Having a rollback plan nobody has rehearsed
  • Waiting too long to decide because the signal thresholds were never defined

Warning signs you are doing it wrong

Signals that the playbook is being executed but not landing.

  • The team says we can always roll back without specifying how
  • Schema or data changes have no containment story
  • Release incidents repeatedly involve indecision about whether to reverse
  • Rollback time is unknown or assumed

Outcomes

Outcomes and signals

What should exist after the playbook runs, how you'll know it worked, and what to watch for over time.

Artifacts to produce

Durable outputs the playbook should leave behind.

  • Reversibility matrix
  • Rollback procedure set
  • Rollback decision sheet
  • Rollback rehearsal notes
  • Rollback-informed design improvements

Success signals

Observable changes that mean the playbook landed.

  • Rollback or containment decisions happen faster and with less confusion
  • The team knows which changes require stronger rollout controls
  • Release fear decreases because reversibility is better understood
  • Future changes are designed with rollback reality in mind

Follow-up actions

Moves that keep the playbook's effects compounding after it finishes.

  • Refresh rollback guidance after major platform or data model changes
  • Review incidents where rollback was delayed or ineffective
  • Connect rollback improvements with rollout design and release confidence work

Metrics or signals to watch

Longer-horizon indicators that the underlying problem is receding.

  • Time to rollback or contain
  • Rollback success rate
  • Number of change types with rehearsed rollback paths
  • Incidents worsened by rollback hesitation

AI impact

AI effects on this playbook

How AI-assisted and AI-driven workflows help execution, and the ways they can make it worse.

AI can help with

Where AI tooling genuinely reduces the cost of running this playbook well.

  • Drafting reversibility matrices and procedure docs
  • Summarizing historical rollback pain points from incidents
  • Checking release plans for rollback blind spots

AI can make worse by

Distortions AI introduces that make the underlying problem harder to see.

  • Creating procedures that sound complete but were never validated
  • Oversimplifying one-way data paths
  • Encouraging false confidence in generated rollback steps

Relationships

Connected playbooks

Failure modes this playbook tends to address, decisions behind the situation, red flags that motivate running it, and neighboring playbooks.