Skip to main content
The Hard Parts.dev
EP-28 Operations EP Engineering Playbook
Difficulty high Owner · engineering manager

Reduce operational dependence on heroes

Make operations less person-fragile by exposing where hero dependence exists, redistributing capability through practice and artifacts, and improving the system conditions that keep creating heroics.

Difficulty
high
Time horizon
weeks to months
Primary owner
engineering manager
Confidence
high
At a glanceEP-28
Situation
Operations, incidents, or risky changes depend on a small number of people.
Goal
Ensure operational safety and continuity do not depend on a single expert, rescuer, or informal escalation hub.
Do not use when
the dependence is temporary and an active handover is already underway
Primary owner
engineering manager
Roles involved

engineering managerservice ownerson-call respondershero or key expertSRE or operations partner

Context

The situation

Deciding whether to reach for this playbook: when it fits, and when it doesn't.

Use when

Conditions where this playbook is the right tool.

  • On-call or incident response depends heavily on one or two people
  • Certain deploys only feel safe when specific people are present
  • Knowledge is routed through memory and chat rather than durable operating materials
  • Heroic rescue work is normalised as strength

Stakes

Why this matters

What this playbook protects against, and why skipping or half-running it tends to be expensive.

Hero dependence makes operations look functional until the hero is overloaded, absent, or wrong. The hidden cost shows up as release anxiety, incident bottlenecks, and slow team maturity.

Quality bar

What good looks like

The observable qualities of a team or system that is actually doing this well. Not just going through the motions.

Signs of the playbook done well

  • Multiple responders can handle common operational work confidently
  • Critical workflows are documented and practiced
  • Escalations become rarer, more specific, and less person-centric
  • Operational load is less unevenly distributed
  • The expert spends less time rescuing and more time improving the system

Preparation

Before you start

What you need available and true before running the procedure. Skipping this is the most common reason playbooks fail.

Inputs

Material you'll want to gather first.

  • Incident responder patterns
  • Release and rollback dependence patterns
  • Runbooks and docs
  • On-call rotation data
  • Hero-centric workflows and known risky areas

Prerequisites

Conditions that should be true for this to work.

  • The team can identify where hero dependence shows up operationally
  • The expert is willing and supported to transfer capability
  • Management treats transfer and practice as real work

Procedure

The procedure

Each step carries its purpose (why it exists), its actions (what you do), and its outputs (what you produce). Read the purpose. It's what keeps the step from degenerating into checklist theatre.

  1. Map hero dependence in operations

    Make the invisible dependency visible and specific.

    Actions

    • List incident classes, deploys, and operational tasks that route to one person
    • Identify what those people know or do that others do not
    • Rank the dependency by risk and recurrence

    Outputs

    • Operational hero map
  2. Transfer capability through live participation

    Build real operational muscle, not passive familiarity.

    Actions

    • Pair on incidents, deployments, and recovery flows
    • Let secondary responders lead parts of the work with support
    • Rotate operational tasks intentionally

    Outputs

    • Capability transfer plan
    • Operational rotation plan
  3. Capture operational memory

    Move crucial actions and judgment cues out of heads.

    Actions

    • Write or improve runbooks, rollback notes, and service maps
    • Document why and when to escalate, not only how
    • Capture historical traps and false positives

    Outputs

    • Operational knowledge pack
  4. Reduce the need for heroics systemically

    Improve the environment that keeps producing hero work.

    Actions

    • Fix alert quality, ownership ambiguity, observability gaps, or release fragility
    • Remove unsafe manual steps where possible
    • Target the recurring conditions that summon the hero

    Outputs

    • Anti-heroics improvement plan
  5. Test for independence

    Confirm the team is genuinely less dependent.

    Actions

    • Simulate or observe normal operations without the hero in the lead role
    • Review where hesitation or confusion still appears
    • Repeat until the dependence narrows to true specialty, not default rescue

    Outputs

    • Independence check

Judgment

Judgment calls and pitfalls

The places where execution actually diverges: decisions that need thought, questions worth asking, and mistakes that recur regardless of good intent.

Decision points

Moments where judgment and trade-offs matter more than procedure.

  • Which operational dependencies are dangerous versus merely specialized?
  • What should be transferred first for biggest risk reduction?
  • What system conditions are creating the repeated heroics?
  • When is the team truly independent enough?

Questions worth asking

Prompts to use on yourself, the team, or an AI assistant while running the procedure.

  • Which operational situations still require a specific person to feel safe?
  • What do those people know, do, or notice that others do not?
  • What system changes would most reduce the need for rescue heroics?

Common mistakes

Patterns that surface across teams running this playbook.

  • Focusing only on documentation and not real operational practice
  • Keeping the hero as final approver forever
  • Ignoring the structural reasons heroics are required
  • Treating availability of the hero as acceptable mitigation

Warning signs you are doing it wrong

Signals that the playbook is being executed but not landing.

  • The same person still gets pulled into every serious event
  • Secondary responders can recite the steps but hesitate in live work
  • Docs improved but operational behavior did not
  • The hero remains overloaded by invisible review and reassurance work

Outcomes

Outcomes and signals

What should exist after the playbook runs, how you'll know it worked, and what to watch for over time.

Artifacts to produce

Durable outputs the playbook should leave behind.

  • Operational hero map
  • Capability transfer plan
  • Operational rotation plan
  • Operational knowledge pack
  • Anti-heroics improvement plan
  • Independence check

Success signals

Observable changes that mean the playbook landed.

  • More responders can handle common incidents and deploys safely
  • Release and incident anxiety decreases when the hero is unavailable
  • The expert is interrupted less often for known scenarios
  • Operational learning becomes more distributed

Follow-up actions

Moves that keep the playbook's effects compounding after it finishes.

  • Keep rotation and shadowing alive after the initial transfer
  • Review new hero concentrations as systems and teams evolve
  • Link anti-heroics work to onboarding and runbook quality

Metrics or signals to watch

Longer-horizon indicators that the underlying problem is receding.

  • Incident dependence on specific individuals
  • Distribution of on-call escalations
  • Number of effective responders per critical service
  • Hero interruption frequency

AI impact

AI effects on this playbook

How AI-assisted and AI-driven workflows help execution, and the ways they can make it worse.

AI can help with

Where AI tooling genuinely reduces the cost of running this playbook well.

  • Finding repeated rescue patterns in incident and chat history
  • Drafting runbooks and knowledge packs from scattered notes
  • Summarizing hidden dependence areas across repos and operational artifacts

AI can make worse by

Distortions AI introduces that make the underlying problem harder to see.

  • Concentrating AI-tool expertise into a new kind of hero role
  • Making operational summaries seem complete when practice is missing
  • Masking dependence with better-looking docs

Relationships

Connected playbooks

Failure modes this playbook tends to address, decisions behind the situation, red flags that motivate running it, and neighboring playbooks.