Reduce operational dependence on heroes

Difficulty: high
Time horizon: weeks to months
Primary owner: engineering manager
Confidence: high

At a glanceEP-28

Situation: Operations, incidents, or risky changes depend on a small number of people.
Goal: Ensure operational safety and continuity do not depend on a single expert, rescuer, or informal escalation hub.
Do not use when: the dependence is temporary and an active handover is already underway
Primary owner: engineering manager
Roles involved: engineering managerservice ownerson-call respondershero or key expertSRE or operations partner

Context

The situation

Deciding whether to reach for this playbook: when it fits, and when it doesn't.

Use when

Conditions where this playbook is the right tool.

On-call or incident response depends heavily on one or two people
Certain deploys only feel safe when specific people are present
Knowledge is routed through memory and chat rather than durable operating materials
Heroic rescue work is normalised as strength

Stakes

Why this matters

What this playbook protects against, and why skipping or half-running it tends to be expensive.

Hero dependence makes operations look functional until the hero is overloaded, absent, or wrong. The hidden cost shows up as release anxiety, incident bottlenecks, and slow team maturity.

Quality bar

What good looks like

The observable qualities of a team or system that is actually doing this well. Not just going through the motions.

Signs of the playbook done well

Multiple responders can handle common operational work confidently
Critical workflows are documented and practiced
Escalations become rarer, more specific, and less person-centric
Operational load is less unevenly distributed
The expert spends less time rescuing and more time improving the system

Preparation

Before you start

What you need available and true before running the procedure. Skipping this is the most common reason playbooks fail.

Inputs

Material you'll want to gather first.

Incident responder patterns
Release and rollback dependence patterns
Runbooks and docs
On-call rotation data
Hero-centric workflows and known risky areas

Prerequisites

Conditions that should be true for this to work.

The team can identify where hero dependence shows up operationally
The expert is willing and supported to transfer capability
Management treats transfer and practice as real work

Procedure

The procedure

Each step carries its purpose (why it exists), its actions (what you do), and its outputs (what you produce). Read the purpose. It's what keeps the step from degenerating into checklist theatre.

01
Map hero dependence in operations
Make the invisible dependency visible and specific.
Actions
- List incident classes, deploys, and operational tasks that route to one person
- Identify what those people know or do that others do not
- Rank the dependency by risk and recurrence
Outputs
- Operational hero map
02
Transfer capability through live participation
Build real operational muscle, not passive familiarity.
Actions
- Pair on incidents, deployments, and recovery flows
- Let secondary responders lead parts of the work with support
- Rotate operational tasks intentionally
Outputs
- Capability transfer plan
- Operational rotation plan
03
Capture operational memory
Move crucial actions and judgment cues out of heads.
Actions
- Write or improve runbooks, rollback notes, and service maps
- Document why and when to escalate, not only how
- Capture historical traps and false positives
Outputs
- Operational knowledge pack
04
Reduce the need for heroics systemically
Improve the environment that keeps producing hero work.
Actions
- Fix alert quality, ownership ambiguity, observability gaps, or release fragility
- Remove unsafe manual steps where possible
- Target the recurring conditions that summon the hero
Outputs
- Anti-heroics improvement plan
05
Test for independence
Confirm the team is genuinely less dependent.
Actions
- Simulate or observe normal operations without the hero in the lead role
- Review where hesitation or confusion still appears
- Repeat until the dependence narrows to true specialty, not default rescue
Outputs
- Independence check

Judgment

Judgment calls and pitfalls

The places where execution actually diverges: decisions that need thought, questions worth asking, and mistakes that recur regardless of good intent.

Decision points

Moments where judgment and trade-offs matter more than procedure.

Which operational dependencies are dangerous versus merely specialized?
What should be transferred first for biggest risk reduction?
What system conditions are creating the repeated heroics?
When is the team truly independent enough?

Questions worth asking

Prompts to use on yourself, the team, or an AI assistant while running the procedure.

Which operational situations still require a specific person to feel safe?
What do those people know, do, or notice that others do not?
What system changes would most reduce the need for rescue heroics?

Common mistakes

Patterns that surface across teams running this playbook.

Focusing only on documentation and not real operational practice
Keeping the hero as final approver forever
Ignoring the structural reasons heroics are required
Treating availability of the hero as acceptable mitigation

Warning signs you are doing it wrong

Signals that the playbook is being executed but not landing.

The same person still gets pulled into every serious event
Secondary responders can recite the steps but hesitate in live work
Docs improved but operational behavior did not
The hero remains overloaded by invisible review and reassurance work

Outcomes

Outcomes and signals

What should exist after the playbook runs, how you'll know it worked, and what to watch for over time.

Artifacts to produce

Durable outputs the playbook should leave behind.

Operational hero map
Capability transfer plan
Operational rotation plan
Operational knowledge pack
Anti-heroics improvement plan
Independence check

Success signals

Observable changes that mean the playbook landed.

More responders can handle common incidents and deploys safely
Release and incident anxiety decreases when the hero is unavailable
The expert is interrupted less often for known scenarios
Operational learning becomes more distributed

Follow-up actions

Moves that keep the playbook's effects compounding after it finishes.

Keep rotation and shadowing alive after the initial transfer
Review new hero concentrations as systems and teams evolve
Link anti-heroics work to onboarding and runbook quality

Metrics or signals to watch

Longer-horizon indicators that the underlying problem is receding.

Incident dependence on specific individuals
Distribution of on-call escalations
Number of effective responders per critical service
Hero interruption frequency

AI impact

AI effects on this playbook

How AI-assisted and AI-driven workflows help execution, and the ways they can make it worse.

AI can help with

Where AI tooling genuinely reduces the cost of running this playbook well.

Finding repeated rescue patterns in incident and chat history
Drafting runbooks and knowledge packs from scattered notes
Summarizing hidden dependence areas across repos and operational artifacts

AI can make worse by

Distortions AI introduces that make the underlying problem harder to see.

Concentrating AI-tool expertise into a new kind of hero role
Making operational summaries seem complete when practice is missing
Masking dependence with better-looking docs

Relationships

Connected playbooks

Failure modes this playbook tends to address, decisions behind the situation, red flags that motivate running it, and neighboring playbooks.