Skip to main content
The Hard Parts.dev
EP-24 Delivery EP Engineering Playbook
Difficulty high Owner · service owner

Stabilize a fragile service

Stabilize by making the service observable, reducing risky change surfaces, clarifying ownership, and fixing the few failure drivers that create most of the pain before chasing architectural perfection.

Difficulty
high
Time horizon
weeks to months
Primary owner
service owner
Confidence
high
At a glanceEP-24
Situation
A service is unreliable, scary to change, or operationally noisy.
Goal
Move a fragile service from fear-based operation to manageable reliability.
Do not use when
the service is noisy mainly because of upstream instability that has not been isolated yet
Primary owner
service owner
Roles involved

service ownertech leadoperations or SREengineering managersupport partner if user-visible failures are frequent

Context

The situation

Deciding whether to reach for this playbook: when it fits, and when it doesn't.

Use when

Conditions where this playbook is the right tool.

  • The service causes frequent incidents or support pain
  • Teams avoid touching it
  • Release confidence is weak
  • One or two people carry most of the risk knowledge

Stakes

Why this matters

What this playbook protects against, and why skipping or half-running it tends to be expensive.

Fragile services create a double cost: user risk and organizational fear. Stabilization should first reduce operational volatility and fear, not chase ideal architecture immediately.

Quality bar

What good looks like

The observable qualities of a team or system that is actually doing this well. Not just going through the motions.

Signs of the playbook done well

  • The main reliability problems are visible and prioritized
  • Owners can explain the service’s top failure modes
  • Risky changes become narrower and more rehearsable
  • Operational response no longer depends on hero memory
  • Incident volume and fear both decline

Preparation

Before you start

What you need available and true before running the procedure. Skipping this is the most common reason playbooks fail.

Inputs

Material you'll want to gather first.

  • Incident history
  • Service dependency map
  • Alert and observability state
  • Release history
  • Ownership map
  • Change hotspot analysis

Prerequisites

Conditions that should be true for this to work.

  • Someone is explicitly accountable for the service
  • Incident history is available
  • Minimum production visibility exists or can be added quickly

Procedure

The procedure

Each step carries its purpose (why it exists), its actions (what you do), and its outputs (what you produce). Read the purpose. It's what keeps the step from degenerating into checklist theatre.

  1. Define the fragility profile

    Turn vague fear into specific failure drivers.

    Actions

    • Review recent incidents and recurring failure patterns
    • Separate reliability issues from change-safety issues
    • Identify the few causes behind most operational pain

    Outputs

    • Fragility profile
  2. Make the service visible

    Reduce time spent guessing.

    Actions

    • Improve key dashboards, alerts, logs, and traces
    • Link alerts to actionable owners and expected responses
    • Document the first places to look during degradation

    Outputs

    • Service observability pack
  3. Reduce risky change surfaces

    Make normal changes less scary.

    Actions

    • Identify hotspots, unstable seams, and oversized change bundles
    • Add guardrails such as flags, staged rollout, or stronger targeted tests
    • Remove or isolate the worst cross-cutting dependencies where practical

    Outputs

    • Change-risk reduction plan
  4. Clarify ownership and response

    Stop fragility from being social as well as technical.

    Actions

    • Name the service owner and backup owners
    • Write or refresh runbooks for the top failure cases
    • Reduce hero dependence through pairing and rotation

    Outputs

    • Ownership matrix
    • Service runbook
  5. Stabilize before redesigning broadly

    Avoid using fragility as a reason for uncontrolled rewrite.

    Actions

    • Fix the top failure drivers first
    • Review whether deeper architectural change is still needed after stability improves
    • Sequence structural work only after immediate fragility is down

    Outputs

    • Stabilization roadmap

Judgment

Judgment calls and pitfalls

The places where execution actually diverges: decisions that need thought, questions worth asking, and mistakes that recur regardless of good intent.

Decision points

Moments where judgment and trade-offs matter more than procedure.

  • What are the top two or three fragility drivers?
  • Which fixes reduce fear fastest: visibility, ownership, rollout control, or refactoring?
  • What should be stabilized before any large redesign?

Questions worth asking

Prompts to use on yourself, the team, or an AI assistant while running the procedure.

  • What are the top failure patterns in this service?
  • Why is the team afraid to change it?
  • What would reduce fragility fastest in the next two sprints?

Common mistakes

Patterns that surface across teams running this playbook.

  • Trying to redesign everything at once
  • Treating observability improvements as optional
  • Keeping service knowledge in one expert while attempting technical fixes
  • Solving every incident locally without building the fragility profile

Warning signs you are doing it wrong

Signals that the playbook is being executed but not landing.

  • People still describe the service as scary but cannot say why specifically
  • Incidents repeat in the same categories
  • The stabilization plan contains mostly architecture ideals and not immediate risk reducers
  • Rollouts still depend on who is online

Outcomes

Outcomes and signals

What should exist after the playbook runs, how you'll know it worked, and what to watch for over time.

Artifacts to produce

Durable outputs the playbook should leave behind.

  • Fragility profile
  • Service observability pack
  • Change-risk reduction plan
  • Ownership matrix
  • Service runbook
  • Stabilization roadmap

Success signals

Observable changes that mean the playbook landed.

  • Incident rate or severity declines in the known categories
  • Release confidence improves
  • More than one engineer can safely work in the service
  • Service discussions shift from fear language to evidence language

Follow-up actions

Moves that keep the playbook's effects compounding after it finishes.

  • Reassess whether major redesign is still justified after stabilization
  • Fold fragility learnings into architecture and platform standards
  • Remove temporary protections when they stop being needed

Metrics or signals to watch

Longer-horizon indicators that the underlying problem is receding.

  • Incident frequency by category
  • Change failure rate
  • Rollback frequency
  • Service alert noise level
  • Number of effective maintainers

AI impact

AI effects on this playbook

How AI-assisted and AI-driven workflows help execution, and the ways they can make it worse.

AI can help with

Where AI tooling genuinely reduces the cost of running this playbook well.

  • Summarizing incident history into failure categories
  • Mapping dependency and hotspot patterns
  • Drafting runbooks and stabilization checklists
  • Highlighting likely risky code surfaces

AI can make worse by

Distortions AI introduces that make the underlying problem harder to see.

  • Encouraging broad refactor output before the fragility is understood
  • Producing shallow fixes at high speed in risky areas
  • Making the service look more documented than it is actually safer

Relationships

Connected playbooks

Failure modes this playbook tends to address, decisions behind the situation, red flags that motivate running it, and neighboring playbooks.