Reduce operational dependence on heroes
Make operations less person-fragile by exposing where hero dependence exists, redistributing capability through practice and artifacts, and improving the system conditions that keep creating heroics.
- Situation
- Operations, incidents, or risky changes depend on a small number of people.
- Goal
- Ensure operational safety and continuity do not depend on a single expert, rescuer, or informal escalation hub.
- Do not use when
- the dependence is temporary and an active handover is already underway
- Primary owner
- engineering manager
- Roles involved
engineering managerservice ownerson-call respondershero or key expertSRE or operations partner
Context
The situation
Deciding whether to reach for this playbook: when it fits, and when it doesn't.
Use when
Conditions where this playbook is the right tool.
- On-call or incident response depends heavily on one or two people
- Certain deploys only feel safe when specific people are present
- Knowledge is routed through memory and chat rather than durable operating materials
- Heroic rescue work is normalised as strength
Do not use when
Contexts where this playbook will waste effort or make things worse.
- The dependence is temporary and an active handover is already underway
- The real problem is severe understaffing or missing operational ownership, not just knowledge concentration
- Leaders want to remove dependence without creating time for transfer or documentation
Stakes
Why this matters
What this playbook protects against, and why skipping or half-running it tends to be expensive.
Hero dependence makes operations look functional until the hero is overloaded, absent, or wrong. The hidden cost shows up as release anxiety, incident bottlenecks, and slow team maturity.
Quality bar
What good looks like
The observable qualities of a team or system that is actually doing this well. Not just going through the motions.
Signs of the playbook done well
- Multiple responders can handle common operational work confidently
- Critical workflows are documented and practiced
- Escalations become rarer, more specific, and less person-centric
- Operational load is less unevenly distributed
- The expert spends less time rescuing and more time improving the system
Preparation
Before you start
What you need available and true before running the procedure. Skipping this is the most common reason playbooks fail.
Inputs
Material you'll want to gather first.
- Incident responder patterns
- Release and rollback dependence patterns
- Runbooks and docs
- On-call rotation data
- Hero-centric workflows and known risky areas
Prerequisites
Conditions that should be true for this to work.
- The team can identify where hero dependence shows up operationally
- The expert is willing and supported to transfer capability
- Management treats transfer and practice as real work
Procedure
The procedure
Each step carries its purpose (why it exists), its actions (what you do), and its outputs (what you produce). Read the purpose. It's what keeps the step from degenerating into checklist theatre.
Map hero dependence in operations
Make the invisible dependency visible and specific.
Actions
- List incident classes, deploys, and operational tasks that route to one person
- Identify what those people know or do that others do not
- Rank the dependency by risk and recurrence
Outputs
- Operational hero map
Transfer capability through live participation
Build real operational muscle, not passive familiarity.
Actions
- Pair on incidents, deployments, and recovery flows
- Let secondary responders lead parts of the work with support
- Rotate operational tasks intentionally
Outputs
- Capability transfer plan
- Operational rotation plan
Capture operational memory
Move crucial actions and judgment cues out of heads.
Actions
- Write or improve runbooks, rollback notes, and service maps
- Document why and when to escalate, not only how
- Capture historical traps and false positives
Outputs
- Operational knowledge pack
Reduce the need for heroics systemically
Improve the environment that keeps producing hero work.
Actions
- Fix alert quality, ownership ambiguity, observability gaps, or release fragility
- Remove unsafe manual steps where possible
- Target the recurring conditions that summon the hero
Outputs
- Anti-heroics improvement plan
Test for independence
Confirm the team is genuinely less dependent.
Actions
- Simulate or observe normal operations without the hero in the lead role
- Review where hesitation or confusion still appears
- Repeat until the dependence narrows to true specialty, not default rescue
Outputs
- Independence check
Judgment
Judgment calls and pitfalls
The places where execution actually diverges: decisions that need thought, questions worth asking, and mistakes that recur regardless of good intent.
Decision points
Moments where judgment and trade-offs matter more than procedure.
- Which operational dependencies are dangerous versus merely specialized?
- What should be transferred first for biggest risk reduction?
- What system conditions are creating the repeated heroics?
- When is the team truly independent enough?
Questions worth asking
Prompts to use on yourself, the team, or an AI assistant while running the procedure.
- Which operational situations still require a specific person to feel safe?
- What do those people know, do, or notice that others do not?
- What system changes would most reduce the need for rescue heroics?
Common mistakes
Patterns that surface across teams running this playbook.
- Focusing only on documentation and not real operational practice
- Keeping the hero as final approver forever
- Ignoring the structural reasons heroics are required
- Treating availability of the hero as acceptable mitigation
Warning signs you are doing it wrong
Signals that the playbook is being executed but not landing.
- The same person still gets pulled into every serious event
- Secondary responders can recite the steps but hesitate in live work
- Docs improved but operational behavior did not
- The hero remains overloaded by invisible review and reassurance work
Outcomes
Outcomes and signals
What should exist after the playbook runs, how you'll know it worked, and what to watch for over time.
Artifacts to produce
Durable outputs the playbook should leave behind.
- Operational hero map
- Capability transfer plan
- Operational rotation plan
- Operational knowledge pack
- Anti-heroics improvement plan
- Independence check
Success signals
Observable changes that mean the playbook landed.
- More responders can handle common incidents and deploys safely
- Release and incident anxiety decreases when the hero is unavailable
- The expert is interrupted less often for known scenarios
- Operational learning becomes more distributed
Follow-up actions
Moves that keep the playbook's effects compounding after it finishes.
- Keep rotation and shadowing alive after the initial transfer
- Review new hero concentrations as systems and teams evolve
- Link anti-heroics work to onboarding and runbook quality
Metrics or signals to watch
Longer-horizon indicators that the underlying problem is receding.
- Incident dependence on specific individuals
- Distribution of on-call escalations
- Number of effective responders per critical service
- Hero interruption frequency
AI impact
AI effects on this playbook
How AI-assisted and AI-driven workflows help execution, and the ways they can make it worse.
AI can help with
Where AI tooling genuinely reduces the cost of running this playbook well.
- Finding repeated rescue patterns in incident and chat history
- Drafting runbooks and knowledge packs from scattered notes
- Summarizing hidden dependence areas across repos and operational artifacts
AI can make worse by
Distortions AI introduces that make the underlying problem harder to see.
- Concentrating AI-tool expertise into a new kind of hero role
- Making operational summaries seem complete when practice is missing
- Masking dependence with better-looking docs
AI synthesis
AI helps convert tribal knowledge into first-pass artifacts, but only repeated live practice reduces real hero dependence.
Relationships
Connected playbooks
Failure modes this playbook tends to address, decisions behind the situation, red flags that motivate running it, and neighboring playbooks.