Skip to main content
The Hard Parts.dev
EP-07 Ai EP Engineering Playbook
Difficulty high Owner · evaluation owner

Detect AI drift before users do

Build a drift-detection system based on task slices, baseline behavior, and operational signals so changes in model quality are noticed intentionally rather than through user frustration or vague team instinct.

Difficulty
high
Time horizon
weeks to establish, then continuous monitoring
Primary owner
evaluation owner
Confidence
high
At a glanceEP-07
Situation
An AI system can change over time through model, prompt, retrieval, or workflow drift.
Goal
Identify and localize meaningful behavior change before it becomes user-facing trust erosion or silent quality decay.
Do not use when
the system is too experimental to promise stable behavior yet
Primary owner
evaluation owner
Roles involved

AI engineerevaluation ownerproduct ownerdomain expertoperations or platform owner where deployment changes matter

Context

The situation

Deciding whether to reach for this playbook: when it fits, and when it doesn't.

Use when

Conditions where this playbook is the right tool.

  • The system depends on external models or evolving prompts
  • RAG, routing, or orchestration changes happen regularly
  • Users say it feels different but the team cannot prove why
  • AI quality matters enough that silent degradation is costly

Stakes

Why this matters

What this playbook protects against, and why skipping or half-running it tends to be expensive.

AI systems drift in ways traditional services often do not. The model may still be up, latency may still be fine, and infrastructure may still look healthy while the answer behavior quietly worsens on the exact tasks users care about.

Quality bar

What good looks like

The observable qualities of a team or system that is actually doing this well. Not just going through the motions.

Signs of the playbook done well

  • The team has task-based baselines rather than vague intuition only
  • Behavior changes are noticed in slices, not only as global complaints
  • Model, prompt, retrieval, and workflow changes are trackable against outcomes
  • Teams can distinguish drift from random bad examples
  • Quality regressions are caught before widespread user distrust forms

Preparation

Before you start

What you need available and true before running the procedure. Skipping this is the most common reason playbooks fail.

Inputs

Material you'll want to gather first.

  • Representative task set
  • Model and prompt version history
  • Retrieval and toolchain changes
  • User feedback signals
  • Evaluation baselines
  • Release history for the AI system

Prerequisites

Conditions that should be true for this to work.

  • There is a task set that reflects real value
  • The team tracks model or system changes over time
  • Someone owns the question of quality stability

Procedure

The procedure

Each step carries its purpose (why it exists), its actions (what you do), and its outputs (what you produce). Read the purpose. It's what keeps the step from degenerating into checklist theatre.

  1. Choose the task slices worth protecting

    Focus monitoring on behavior users actually care about.

    Actions

    • Define core tasks and high-risk slices
    • Include edge cases and ambiguity-heavy scenarios
    • Prioritize slices where errors are subtle or harmful

    Outputs

    • Drift protection task set
  2. Establish baseline behavior

    Create a reference point that can be compared over time.

    Actions

    • Record expected performance by slice
    • Capture both score-based and exemplar-based baselines
    • Note known weak areas so future changes are interpreted correctly

    Outputs

    • Behavior baseline
  3. Track system changes that can cause drift

    Connect quality change to system change.

    Actions

    • Log model, prompt, retrieval, routing, and tool-use changes
    • Tie deployments and configuration shifts to evaluation checkpoints
    • Avoid silent changes with no quality observation

    Outputs

    • Drift change log
  4. Monitor both evaluation and field signals

    Combine lab and production awareness.

    Actions

    • Run recurring task-slice evaluations
    • Watch user correction patterns, complaints, abstentions, and escalation behavior
    • Compare field signals to baseline shifts

    Outputs

    • Drift monitoring dashboard
  5. Respond with diagnosis, not panic

    Localize drift and choose the right fix.

    Actions

    • Identify whether drift came from model change, retrieval quality, prompt change, or surrounding workflow
    • Roll back or isolate the change where possible
    • Update baselines when intended behavior improvements genuinely occurred

    Outputs

    • Drift response playbook
    • Updated baseline when justified

Judgment

Judgment calls and pitfalls

The places where execution actually diverges: decisions that need thought, questions worth asking, and mistakes that recur regardless of good intent.

Decision points

Moments where judgment and trade-offs matter more than procedure.

  • Which task slices are important enough to monitor continuously?
  • What counts as meaningful drift versus acceptable variance?
  • Which upstream changes require a fresh evaluation checkpoint?
  • When should the team roll back, re-tune, or just re-baseline?

Questions worth asking

Prompts to use on yourself, the team, or an AI assistant while running the procedure.

  • What task slices matter enough that we should detect regression before users do?
  • What upstream changes should automatically trigger re-evaluation?
  • How would we distinguish model drift from retrieval or workflow drift here?

Common mistakes

Patterns that surface across teams running this playbook.

  • Tracking overall score while missing slice-specific degradation
  • Not versioning prompts, retrieval logic, or model providers clearly
  • Waiting for user complaints to define drift
  • Treating every quality change as model drift when workflow changes caused it

Warning signs you are doing it wrong

Signals that the playbook is being executed but not landing.

  • Users say it feels worse but the team has no comparison evidence
  • Quality changes are discussed socially rather than measured
  • Different parts of the AI stack change without evaluation checkpoints
  • The baseline quietly becomes stale and stops reflecting the real task

Outcomes

Outcomes and signals

What should exist after the playbook runs, how you'll know it worked, and what to watch for over time.

Artifacts to produce

Durable outputs the playbook should leave behind.

  • Drift protection task set
  • Behavior baseline
  • Drift change log
  • Drift monitoring dashboard
  • Drift response playbook

Success signals

Observable changes that mean the playbook landed.

  • Quality shifts are caught before widespread user dissatisfaction
  • The team can localize likely causes of drift faster
  • Model or retrieval changes are accompanied by meaningful evaluation
  • Trust conversations move from vibes to evidence

Follow-up actions

Moves that keep the playbook's effects compounding after it finishes.

  • Refresh task slices as user workflows evolve
  • Retire stale baselines and replace them with living ones
  • Connect drift incidents back into release and evaluation design

Metrics or signals to watch

Longer-horizon indicators that the underlying problem is receding.

  • Task-slice performance over time
  • User correction or override rate
  • Complaint frequency by slice
  • Time from quality shift to detection
  • Number of untracked system changes affecting quality

AI impact

AI effects on this playbook

How AI-assisted and AI-driven workflows help execution, and the ways they can make it worse.

AI can help with

Where AI tooling genuinely reduces the cost of running this playbook well.

  • Comparing outputs across model or prompt versions
  • Clustering drift examples by failure pattern
  • Summarizing which slices changed most between runs

AI can make worse by

Distortions AI introduces that make the underlying problem harder to see.

  • Over-explaining drift with plausible but unverified narratives
  • Normalizing drift because the model still sounds fluent
  • Encouraging baseline churn so no stable comparison remains

Relationships

Connected playbooks

Failure modes this playbook tends to address, decisions behind the situation, red flags that motivate running it, and neighboring playbooks.