Skip to main content
The Hard Parts.dev
EP-30 Operations EP Engineering Playbook
Difficulty medium-high Owner · engineering manager

Triage operational debt

Triage operational debt by identifying recurring pain patterns, ranking them by impact and drag, and choosing what to eliminate, reduce, contain, or deliberately tolerate for now.

Difficulty
medium-high
Time horizon
days to assess, then ongoing prioritization over weeks or quarters
Primary owner
engineering manager
Confidence
high
At a glanceEP-30
Situation
Operational pain is accumulating faster than the team can address it all at once.
Goal
Turn operational debt from background suffering into an explicit, prioritized reliability and delivery problem.
Do not use when
the team has almost no visibility into its operational pain and needs measurement first
Primary owner
engineering manager
Roles involved

engineering managerservice ownersoperations or SREtech leaddelivery lead when debt affects roadmap predictability

Context

The situation

Deciding whether to reach for this playbook: when it fits, and when it doesn't.

Use when

Conditions where this playbook is the right tool.

  • The team lives with too many known operational rough edges
  • Incidents, noise, manual work, and brittle release paths accumulate faster than fixes
  • Teams repeatedly say we know about it but never get to it
  • Operational debt is clearly affecting delivery and morale

Stakes

Why this matters

What this playbook protects against, and why skipping or half-running it tends to be expensive.

Operational debt becomes normal very quickly. Once normalized, teams stop seeing it as a system cost even while it quietly eats roadmap time, responder attention, and trust.

Quality bar

What good looks like

The observable qualities of a team or system that is actually doing this well. Not just going through the motions.

Signs of the playbook done well

  • Operational debt items are clustered into meaningful classes, not just a long complaint list
  • The team knows which debt creates the most repeat pain or risk
  • Some debt is explicitly accepted rather than silently carried
  • Debt reduction work is tied to actual operational outcomes

Preparation

Before you start

What you need available and true before running the procedure. Skipping this is the most common reason playbooks fail.

Inputs

Material you'll want to gather first.

  • Incident history
  • Manual operational tasks
  • Alert fatigue patterns
  • Release and rollback pain
  • Support escalations
  • Team-reported chronic friction

Prerequisites

Conditions that should be true for this to work.

  • The team can identify recurring operational pain with examples
  • There is space to reprioritize some capacity
  • Leadership accepts that some roadmap work may need to move

Procedure

The procedure

Each step carries its purpose (why it exists), its actions (what you do), and its outputs (what you produce). Read the purpose. It's what keeps the step from degenerating into checklist theatre.

  1. Collect debt by pain pattern, not by random complaint

    Turn noise into operational categories.

    Actions

    • Gather recurring manual steps, noisy failures, fragile deploy paths, and repeat incidents
    • Group them into patterns such as alerting, ownership, release, observability, data recovery, or runbook gaps
    • Connect each to actual operational cost

    Outputs

    • Operational debt map
  2. Rank debt by impact and drag

    Prioritize where debt reduction matters most.

    Actions

    • Estimate impact on incidents, team load, roadmap drag, and risk exposure
    • Distinguish high-cost repeat pain from annoying but low-impact problems
    • Identify hidden dependencies where one debt class amplifies others

    Outputs

    • Debt ranking
  3. Choose treatment mode for each major item

    Avoid pretending every debt needs full elimination now.

    Actions

    • Label items as eliminate, reduce, contain, monitor, or accept
    • Record what evidence would justify keeping or escalating an item
    • Make acceptance explicit instead of silent

    Outputs

    • Debt treatment plan
  4. Allocate capacity intentionally

    Make debt reduction part of the operating plan.

    Actions

    • Reserve a realistic slice of capacity for operational debt work
    • Sequence debt fixes that unlock other fixes or reduce repeated toil fastest
    • Tie work to measurable operational gains where possible

    Outputs

    • Operational debt roadmap
  5. Review whether debt load is shrinking or shifting

    Keep the debt picture honest over time.

    Actions

    • Track whether targeted debt reduced incidents, noise, or manual work
    • Review new debt created by growth or platform change
    • Refresh ranking periodically

    Outputs

    • Debt review cycle

Judgment

Judgment calls and pitfalls

The places where execution actually diverges: decisions that need thought, questions worth asking, and mistakes that recur regardless of good intent.

Decision points

Moments where judgment and trade-offs matter more than procedure.

  • Which debt is actually causing the most repeated pain?
  • What should be fixed now versus explicitly accepted for now?
  • Which debt classes unlock the biggest reduction in operational drag?
  • How much capacity can the team protect for this work?

Questions worth asking

Prompts to use on yourself, the team, or an AI assistant while running the procedure.

  • Which operational pain is costing us the most repeatedly?
  • What debt should we explicitly accept versus keep pretending is temporary?
  • Which one or two fixes would reduce the most repeated toil or incident risk?

Common mistakes

Patterns that surface across teams running this playbook.

  • Keeping an unranked debt list forever
  • Treating all debt as equally urgent
  • Hiding accepted debt instead of naming it
  • Measuring debt reduction by tickets closed instead of pain removed

Warning signs you are doing it wrong

Signals that the playbook is being executed but not landing.

  • The debt list keeps growing but nothing becomes easier operationally
  • The same debt themes dominate incidents and retros quarter after quarter
  • Leaders say debt matters but capacity is never protected
  • Debt work produces artifacts without reducing operational burden

Outcomes

Outcomes and signals

What should exist after the playbook runs, how you'll know it worked, and what to watch for over time.

Artifacts to produce

Durable outputs the playbook should leave behind.

  • Operational debt map
  • Debt ranking
  • Debt treatment plan
  • Operational debt roadmap
  • Debt review cycle

Success signals

Observable changes that mean the playbook landed.

  • Repeat pain and responder burden decline in targeted areas
  • The team can explain its operational debt posture clearly
  • Some accepted debt becomes visible rather than quietly normalized
  • Delivery stability improves because operational drag decreases

Follow-up actions

Moves that keep the playbook's effects compounding after it finishes.

  • Connect repeat debt classes to architecture and ownership changes
  • Keep accepted debt visible in planning conversations
  • Re-run the triage after major platform or team changes

Metrics or signals to watch

Longer-horizon indicators that the underlying problem is receding.

  • Manual operational hours
  • Repeat incident count by debt class
  • Alert fatigue indicators
  • Release friction indicators
  • Operational debt trend by category

AI impact

AI effects on this playbook

How AI-assisted and AI-driven workflows help execution, and the ways they can make it worse.

AI can help with

Where AI tooling genuinely reduces the cost of running this playbook well.

  • Clustering incident and toil patterns into debt categories
  • Drafting debt rankings and treatment options
  • Finding recurring operational pain in chats, tickets, and runbooks

AI can make worse by

Distortions AI introduces that make the underlying problem harder to see.

  • Inflating debt inventories with low-value noise
  • Making the debt program look organized without real reprioritization
  • Encouraging summary over judgment

Relationships

Connected playbooks

Failure modes this playbook tends to address, decisions behind the situation, red flags that motivate running it, and neighboring playbooks.