Skip to main content
The Hard Parts.dev
EP-29 Operations EP Engineering Playbook
Difficulty medium-high Owner · service owner

Create meaningful alerts

Design alerts around actionable operational meaning: what is wrong, who should care, how urgent it is, and what first action or investigation path should follow.

Difficulty
medium-high
Time horizon
days to weeks, ongoing tuning afterward
Primary owner
service owner
Confidence
high
At a glanceEP-29
Situation
Alerting is noisy, weak, or not tied to meaningful operational action.
Goal
Reduce noisy, low-trust alerting and improve the team’s ability to detect and respond to important issues quickly.
Do not use when
the service lacks basic instrumentation and needs observability foundations first
Primary owner
service owner
Roles involved

service ownerSRE or operations partneron-call responderstech leadproduct or business contact when user harm thresholds matter

Context

The situation

Deciding whether to reach for this playbook: when it fits, and when it doesn't.

Use when

Conditions where this playbook is the right tool.

  • On-call responders ignore, mute, or distrust alerts
  • Important incidents are detected late despite many alerts
  • Alerts do not map clearly to ownership or action
  • Teams argue over which alerts matter

Stakes

Why this matters

What this playbook protects against, and why skipping or half-running it tends to be expensive.

Alerts are supposed to convert system degradation into timely human attention. When they are noisy, vague, or ownerless, they train people to ignore the very system meant to protect them.

Quality bar

What good looks like

The observable qualities of a team or system that is actually doing this well. Not just going through the motions.

Signs of the playbook done well

  • Alerts correspond to user or system-impactful conditions
  • Responders know what each important alert means and who owns it
  • Alert volume is low enough that attention still has value
  • Alerts link naturally into runbooks or diagnostic paths
  • False positives and repeated low-value alerts are actively pruned

Preparation

Before you start

What you need available and true before running the procedure. Skipping this is the most common reason playbooks fail.

Inputs

Material you'll want to gather first.

  • Current alert inventory
  • Incident history
  • False positive and missed detection patterns
  • Service SLOs or practical health indicators
  • Ownership and escalation map

Prerequisites

Conditions that should be true for this to work.

  • The team can inspect current alerting and incident outcomes
  • There is enough telemetry to build meaningful conditions
  • Ownership and escalation paths are defined

Procedure

The procedure

Each step carries its purpose (why it exists), its actions (what you do), and its outputs (what you produce). Read the purpose. It's what keeps the step from degenerating into checklist theatre.

  1. Audit the current alert set

    Measure what the alert system is actually teaching responders.

    Actions

    • Review current alerts, frequency, recipients, and outcomes
    • Identify noisy, ignored, duplicate, and low-value alerts
    • Find incidents that were missed or detected too late

    Outputs

    • Alert audit
  2. Classify alerts by operational meaning

    Separate what matters from what is merely measurable.

    Actions

    • Group alerts into user-impact, service-health, dependency, capacity, and informational classes
    • Decide which alerts require action and which belong in dashboards only
    • Tie urgency to impact, not just threshold breach

    Outputs

    • Alert taxonomy
  3. Align alerting with ownership and first action

    Make alerts useful the moment they fire.

    Actions

    • Assign ownership for each action-worthy alert
    • Link alerts to runbooks, dashboards, or diagnostic starting points
    • Clarify escalation expectations

    Outputs

    • Owner-mapped alert set
  4. Tune for trust, not volume

    Reduce alert fatigue while preserving detection.

    Actions

    • Remove or downgrade low-value alerts
    • Tighten thresholds and grouping where repeated noise exists
    • Prefer symptom and impact signals over endless internal chatter

    Outputs

    • Tuned alert configuration
  5. Review alerts after real events

    Keep the alert system grounded in operational outcomes.

    Actions

    • Check which alerts helped, confused, or failed to detect incidents
    • Adjust based on real operational use
    • Treat alert design as continuous operational design, not one-time setup

    Outputs

    • Post-incident alert review

Judgment

Judgment calls and pitfalls

The places where execution actually diverges: decisions that need thought, questions worth asking, and mistakes that recur regardless of good intent.

Decision points

Moments where judgment and trade-offs matter more than procedure.

  • Which conditions deserve interrupting a human?
  • What should be a dashboard signal instead of an alert?
  • What first action should follow this alert?
  • What threshold reflects meaningful risk rather than background noise?

Questions worth asking

Prompts to use on yourself, the team, or an AI assistant while running the procedure.

  • Which alerts actually changed responder behavior during real incidents?
  • What should wake a human versus remain visible in a dashboard?
  • What does this alert expect the responder to do first?

Common mistakes

Patterns that surface across teams running this playbook.

  • Alerting on everything measurable
  • Sending the same low-context alert to everyone
  • Confusing threshold breach with action-worthy degradation
  • Never pruning alerts once added

Warning signs you are doing it wrong

Signals that the playbook is being executed but not landing.

  • Responders mute or ignore frequent alert classes
  • Major incidents are still discovered through users or side channels
  • Alerts require tribal knowledge to interpret
  • The alert list grows but trust in alerting shrinks

Outcomes

Outcomes and signals

What should exist after the playbook runs, how you'll know it worked, and what to watch for over time.

Artifacts to produce

Durable outputs the playbook should leave behind.

  • Alert audit
  • Alert taxonomy
  • Owner-mapped alert set
  • Tuned alert configuration
  • Post-incident alert review

Success signals

Observable changes that mean the playbook landed.

  • Important issues are detected earlier
  • Noise and false positives decline
  • Responders know what to do when key alerts fire
  • Alert discussions become more about meaning than fatigue

Follow-up actions

Moves that keep the playbook's effects compounding after it finishes.

  • Review alert health after every major incident
  • Fold changes into runbooks and service onboarding
  • Remove stale alerts as architecture and traffic patterns evolve

Metrics or signals to watch

Longer-horizon indicators that the underlying problem is receding.

  • Alert volume per responder
  • False positive rate
  • Missed detection count
  • Time from issue onset to useful human awareness

AI impact

AI effects on this playbook

How AI-assisted and AI-driven workflows help execution, and the ways they can make it worse.

AI can help with

Where AI tooling genuinely reduces the cost of running this playbook well.

  • Grouping alert noise patterns
  • Correlating alerts with incident outcomes
  • Drafting alert rationales and runbook links

AI can make worse by

Distortions AI introduces that make the underlying problem harder to see.

  • Encouraging more alerts because generation is cheap
  • Creating plausible descriptions for alerts that still lack action value
  • Masking noisy alerting behind better wording

Relationships

Connected playbooks

Failure modes this playbook tends to address, decisions behind the situation, red flags that motivate running it, and neighboring playbooks.