Create meaningful alerts · thehardparts.dev

Difficulty: medium-high
Time horizon: days to weeks, ongoing tuning afterward
Primary owner: service owner
Confidence: high

At a glanceEP-29

Situation: Alerting is noisy, weak, or not tied to meaningful operational action.
Goal: Reduce noisy, low-trust alerting and improve the team’s ability to detect and respond to important issues quickly.
Do not use when: the service lacks basic instrumentation and needs observability foundations first
Primary owner: service owner
Roles involved: service ownerSRE or operations partneron-call responderstech leadproduct or business contact when user harm thresholds matter

Context

The situation

Deciding whether to reach for this playbook: when it fits, and when it doesn't.

Use when

Conditions where this playbook is the right tool.

On-call responders ignore, mute, or distrust alerts
Important incidents are detected late despite many alerts
Alerts do not map clearly to ownership or action
Teams argue over which alerts matter

Stakes

Why this matters

What this playbook protects against, and why skipping or half-running it tends to be expensive.

Alerts are supposed to convert system degradation into timely human attention. When they are noisy, vague, or ownerless, they train people to ignore the very system meant to protect them.

Quality bar

What good looks like

The observable qualities of a team or system that is actually doing this well. Not just going through the motions.

Signs of the playbook done well

Alerts correspond to user or system-impactful conditions
Responders know what each important alert means and who owns it
Alert volume is low enough that attention still has value
Alerts link naturally into runbooks or diagnostic paths
False positives and repeated low-value alerts are actively pruned

Preparation

Before you start

What you need available and true before running the procedure. Skipping this is the most common reason playbooks fail.

Inputs

Material you'll want to gather first.

Current alert inventory
Incident history
False positive and missed detection patterns
Service SLOs or practical health indicators
Ownership and escalation map

Prerequisites

Conditions that should be true for this to work.

The team can inspect current alerting and incident outcomes
There is enough telemetry to build meaningful conditions
Ownership and escalation paths are defined

Procedure

The procedure

Each step carries its purpose (why it exists), its actions (what you do), and its outputs (what you produce). Read the purpose. It's what keeps the step from degenerating into checklist theatre.

01
Audit the current alert set
Measure what the alert system is actually teaching responders.
Actions
- Review current alerts, frequency, recipients, and outcomes
- Identify noisy, ignored, duplicate, and low-value alerts
- Find incidents that were missed or detected too late
Outputs
- Alert audit
02
Classify alerts by operational meaning
Separate what matters from what is merely measurable.
Actions
- Group alerts into user-impact, service-health, dependency, capacity, and informational classes
- Decide which alerts require action and which belong in dashboards only
- Tie urgency to impact, not just threshold breach
Outputs
- Alert taxonomy
03
Align alerting with ownership and first action
Make alerts useful the moment they fire.
Actions
- Assign ownership for each action-worthy alert
- Link alerts to runbooks, dashboards, or diagnostic starting points
- Clarify escalation expectations
Outputs
- Owner-mapped alert set
04
Tune for trust, not volume
Reduce alert fatigue while preserving detection.
Actions
- Remove or downgrade low-value alerts
- Tighten thresholds and grouping where repeated noise exists
- Prefer symptom and impact signals over endless internal chatter
Outputs
- Tuned alert configuration
05
Review alerts after real events
Keep the alert system grounded in operational outcomes.
Actions
- Check which alerts helped, confused, or failed to detect incidents
- Adjust based on real operational use
- Treat alert design as continuous operational design, not one-time setup
Outputs
- Post-incident alert review

Judgment

Judgment calls and pitfalls

The places where execution actually diverges: decisions that need thought, questions worth asking, and mistakes that recur regardless of good intent.

Decision points

Moments where judgment and trade-offs matter more than procedure.

Which conditions deserve interrupting a human?
What should be a dashboard signal instead of an alert?
What first action should follow this alert?
What threshold reflects meaningful risk rather than background noise?

Questions worth asking

Prompts to use on yourself, the team, or an AI assistant while running the procedure.

Which alerts actually changed responder behavior during real incidents?
What should wake a human versus remain visible in a dashboard?
What does this alert expect the responder to do first?

Common mistakes

Patterns that surface across teams running this playbook.

Alerting on everything measurable
Sending the same low-context alert to everyone
Confusing threshold breach with action-worthy degradation
Never pruning alerts once added

Warning signs you are doing it wrong

Signals that the playbook is being executed but not landing.

Responders mute or ignore frequent alert classes
Major incidents are still discovered through users or side channels
Alerts require tribal knowledge to interpret
The alert list grows but trust in alerting shrinks

Outcomes

Outcomes and signals

What should exist after the playbook runs, how you'll know it worked, and what to watch for over time.

Artifacts to produce

Durable outputs the playbook should leave behind.

Alert audit
Alert taxonomy
Owner-mapped alert set
Tuned alert configuration
Post-incident alert review

Success signals

Observable changes that mean the playbook landed.

Important issues are detected earlier
Noise and false positives decline
Responders know what to do when key alerts fire
Alert discussions become more about meaning than fatigue

Follow-up actions

Moves that keep the playbook's effects compounding after it finishes.

Review alert health after every major incident
Fold changes into runbooks and service onboarding
Remove stale alerts as architecture and traffic patterns evolve

Metrics or signals to watch

Longer-horizon indicators that the underlying problem is receding.

Alert volume per responder
False positive rate
Missed detection count
Time from issue onset to useful human awareness

AI impact

AI effects on this playbook

How AI-assisted and AI-driven workflows help execution, and the ways they can make it worse.

AI can help with

Where AI tooling genuinely reduces the cost of running this playbook well.

Grouping alert noise patterns
Correlating alerts with incident outcomes
Drafting alert rationales and runbook links

AI can make worse by

Distortions AI introduces that make the underlying problem harder to see.

Encouraging more alerts because generation is cheap
Creating plausible descriptions for alerts that still lack action value
Masking noisy alerting behind better wording

Relationships

Connected playbooks

Failure modes this playbook tends to address, decisions behind the situation, red flags that motivate running it, and neighboring playbooks.