Skip to main content
The Hard Parts.dev
EP-15 Architecture EP Engineering Playbook
Difficulty medium-high Owner · tech lead

Design a safe rollout path

Design rollout as a control system: decide how to limit blast radius, observe early effects, pause or reverse safely, and learn from each stage before widening exposure.

Difficulty
medium-high
Time horizon
hours to days for planning, longer for complex staged rollout execution
Primary owner
tech lead
Confidence
high
At a glanceEP-15
Situation
A change is ready to ship, but rollout strategy matters as much as the code itself.
Goal
Reduce the chance that a valid code change becomes a production incident because exposure, detection, or reversal was poorly designed.
Do not use when
the change is routine and the team is adding rollout complexity by default
Primary owner
tech lead
Roles involved

tech leadrelease ownerSRE or operations partnerQA or quality ownerproduct or business owner if user rollout needs coordination

Context

The situation

Deciding whether to reach for this playbook: when it fits, and when it doesn't.

Use when

Conditions where this playbook is the right tool.

  • A change affects critical workflows or large user populations
  • Rollback or reversal is non-trivial
  • The change crosses data, infra, or contract boundaries
  • A progressive rollout could materially reduce risk

Stakes

Why this matters

What this playbook protects against, and why skipping or half-running it tends to be expensive.

Many failures attributed to bad code are really failures of exposure design. A safe rollout path makes uncertainty manageable instead of hoping pre-production confidence covered everything.

Quality bar

What good looks like

The observable qualities of a team or system that is actually doing this well. Not just going through the motions.

Signs of the playbook done well

  • The first exposure is intentionally smaller than the full blast radius
  • Release stages map to observable signals and explicit decisions
  • Pause and rollback paths are credible and understood
  • The team knows when to continue, hold, or reverse
  • Rollout shape matches the risk shape of the change

Preparation

Before you start

What you need available and true before running the procedure. Skipping this is the most common reason playbooks fail.

Inputs

Material you'll want to gather first.

  • Change description
  • Affected systems and users
  • Rollback constraints
  • Release windows
  • Observability setup
  • Incident and failure scenarios

Prerequisites

Conditions that should be true for this to work.

  • Minimum observability exists
  • The team can control exposure in some meaningful way
  • Someone owns the operational decision during rollout

Procedure

The procedure

Each step carries its purpose (why it exists), its actions (what you do), and its outputs (what you produce). Read the purpose. It's what keeps the step from degenerating into checklist theatre.

  1. Classify the change by reversibility and blast radius

    Match rollout design to real risk.

    Actions

    • Decide whether the change is reversible, partially reversible, or effectively one-way
    • Identify the most dangerous failure modes
    • Determine which users, traffic, data, or dependencies are most sensitive

    Outputs

    • Rollout risk profile
  2. Choose the exposure ladder

    Define how the change will expand safely.

    Actions

    • Choose a staged pattern such as internal-first, cohort-based, traffic percentage, regional, or feature-flag rollout
    • Ensure the early stage is small enough to learn safely
    • Prefer simpler ladders the team can operate confidently

    Outputs

    • Exposure ladder
  3. Define decision signals at each stage

    Prevent rollout from becoming intuition-driven.

    Actions

    • Name leading indicators and key alarms for each stage
    • Define hold, continue, and rollback thresholds
    • Tie stage progression to evidence, not elapsed time alone

    Outputs

    • Stage decision sheet
  4. Make pause and rollback credible

    Ensure the team can really stop or reverse.

    Actions

    • Clarify feature-flag, routing, or deploy reversal mechanisms
    • Identify what cannot be undone and how that affects stage size
    • Rehearse the decision and communication path

    Outputs

    • Rollback and pause plan
  5. Run and review the rollout

    Learn from the release as it progresses.

    Actions

    • Record decisions and observations at each stage
    • Avoid widening exposure while key signals are ambiguous
    • Capture lessons about whether the rollout shape matched the risk

    Outputs

    • Rollout execution log
    • Rollout review

Judgment

Judgment calls and pitfalls

The places where execution actually diverges: decisions that need thought, questions worth asking, and mistakes that recur regardless of good intent.

Decision points

Moments where judgment and trade-offs matter more than procedure.

  • How reversible is this change in practice?
  • What is the smallest safe first exposure?
  • What exact signals must be green before the next stage?
  • When is pause better than rollback?

Questions worth asking

Prompts to use on yourself, the team, or an AI assistant while running the procedure.

  • What is the smallest exposure that still teaches us something useful?
  • What signal tells us to stop, not just to watch more closely?
  • What part of this change is actually hard to reverse?

Common mistakes

Patterns that surface across teams running this playbook.

  • Shipping in one big step because the team is impatient
  • Using feature flags without operational clarity on how to use them
  • Equating deploy success with rollout safety
  • Widening exposure despite ambiguous signals

Warning signs you are doing it wrong

Signals that the playbook is being executed but not landing.

  • The team says we will know if it is bad but cannot define the signals
  • Stage gates are based only on time passed
  • Rollback exists technically but no one has walked through it
  • The rollout plan is more complicated than the team can operate calmly

Outcomes

Outcomes and signals

What should exist after the playbook runs, how you'll know it worked, and what to watch for over time.

Artifacts to produce

Durable outputs the playbook should leave behind.

  • Rollout risk profile
  • Exposure ladder
  • Stage decision sheet
  • Rollback and pause plan
  • Rollout execution log

Success signals

Observable changes that mean the playbook landed.

  • Exposure grows in controlled steps
  • Teams make rollout decisions using agreed signals
  • Small anomalies are caught before large exposure
  • The rollout becomes a reusable model for similar changes

Follow-up actions

Moves that keep the playbook's effects compounding after it finishes.

  • Review whether better architecture or contracts could make future rollout simpler
  • Clean up temporary controls and flags after stabilization
  • Update release guidance with what the rollout taught the team

Metrics or signals to watch

Longer-horizon indicators that the underlying problem is receding.

  • Time to detect bad effect after stage entry
  • Rollback or pause decision latency
  • User or system impact per rollout stage
  • Number of rollout stages completed without surprise escalation

AI impact

AI effects on this playbook

How AI-assisted and AI-driven workflows help execution, and the ways they can make it worse.

AI can help with

Where AI tooling genuinely reduces the cost of running this playbook well.

  • Drafting stage decision sheets and risk summaries
  • Finding similar past releases and failure patterns
  • Suggesting likely observability gaps from prior incidents

AI can make worse by

Distortions AI introduces that make the underlying problem harder to see.

  • Encouraging overdesigned rollout plans the team cannot actually operate
  • Producing confident rollback language without testing reality
  • Making release documents look stronger than the underlying controls

Relationships

Connected playbooks

Failure modes this playbook tends to address, decisions behind the situation, red flags that motivate running it, and neighboring playbooks.