Detect AI drift before users do
Build a drift-detection system based on task slices, baseline behavior, and operational signals so changes in model quality are noticed intentionally rather than through user frustration or vague team instinct.
- Situation
- An AI system can change over time through model, prompt, retrieval, or workflow drift.
- Goal
- Identify and localize meaningful behavior change before it becomes user-facing trust erosion or silent quality decay.
- Do not use when
- the system is too experimental to promise stable behavior yet
- Primary owner
- evaluation owner
- Roles involved
AI engineerevaluation ownerproduct ownerdomain expertoperations or platform owner where deployment changes matter
Context
The situation
Deciding whether to reach for this playbook: when it fits, and when it doesn't.
Use when
Conditions where this playbook is the right tool.
- The system depends on external models or evolving prompts
- RAG, routing, or orchestration changes happen regularly
- Users say it feels different but the team cannot prove why
- AI quality matters enough that silent degradation is costly
Do not use when
Contexts where this playbook will waste effort or make things worse.
- The system is too experimental to promise stable behavior yet
- No baseline or representative task set exists at all
- The team only wants generic monitoring and not task-grounded quality detection
Stakes
Why this matters
What this playbook protects against, and why skipping or half-running it tends to be expensive.
AI systems drift in ways traditional services often do not. The model may still be up, latency may still be fine, and infrastructure may still look healthy while the answer behavior quietly worsens on the exact tasks users care about.
Quality bar
What good looks like
The observable qualities of a team or system that is actually doing this well. Not just going through the motions.
Signs of the playbook done well
- The team has task-based baselines rather than vague intuition only
- Behavior changes are noticed in slices, not only as global complaints
- Model, prompt, retrieval, and workflow changes are trackable against outcomes
- Teams can distinguish drift from random bad examples
- Quality regressions are caught before widespread user distrust forms
Preparation
Before you start
What you need available and true before running the procedure. Skipping this is the most common reason playbooks fail.
Inputs
Material you'll want to gather first.
- Representative task set
- Model and prompt version history
- Retrieval and toolchain changes
- User feedback signals
- Evaluation baselines
- Release history for the AI system
Prerequisites
Conditions that should be true for this to work.
- There is a task set that reflects real value
- The team tracks model or system changes over time
- Someone owns the question of quality stability
Procedure
The procedure
Each step carries its purpose (why it exists), its actions (what you do), and its outputs (what you produce). Read the purpose. It's what keeps the step from degenerating into checklist theatre.
Choose the task slices worth protecting
Focus monitoring on behavior users actually care about.
Actions
- Define core tasks and high-risk slices
- Include edge cases and ambiguity-heavy scenarios
- Prioritize slices where errors are subtle or harmful
Outputs
- Drift protection task set
Establish baseline behavior
Create a reference point that can be compared over time.
Actions
- Record expected performance by slice
- Capture both score-based and exemplar-based baselines
- Note known weak areas so future changes are interpreted correctly
Outputs
- Behavior baseline
Track system changes that can cause drift
Connect quality change to system change.
Actions
- Log model, prompt, retrieval, routing, and tool-use changes
- Tie deployments and configuration shifts to evaluation checkpoints
- Avoid silent changes with no quality observation
Outputs
- Drift change log
Monitor both evaluation and field signals
Combine lab and production awareness.
Actions
- Run recurring task-slice evaluations
- Watch user correction patterns, complaints, abstentions, and escalation behavior
- Compare field signals to baseline shifts
Outputs
- Drift monitoring dashboard
Respond with diagnosis, not panic
Localize drift and choose the right fix.
Actions
- Identify whether drift came from model change, retrieval quality, prompt change, or surrounding workflow
- Roll back or isolate the change where possible
- Update baselines when intended behavior improvements genuinely occurred
Outputs
- Drift response playbook
- Updated baseline when justified
Judgment
Judgment calls and pitfalls
The places where execution actually diverges: decisions that need thought, questions worth asking, and mistakes that recur regardless of good intent.
Decision points
Moments where judgment and trade-offs matter more than procedure.
- Which task slices are important enough to monitor continuously?
- What counts as meaningful drift versus acceptable variance?
- Which upstream changes require a fresh evaluation checkpoint?
- When should the team roll back, re-tune, or just re-baseline?
Questions worth asking
Prompts to use on yourself, the team, or an AI assistant while running the procedure.
- What task slices matter enough that we should detect regression before users do?
- What upstream changes should automatically trigger re-evaluation?
- How would we distinguish model drift from retrieval or workflow drift here?
Common mistakes
Patterns that surface across teams running this playbook.
- Tracking overall score while missing slice-specific degradation
- Not versioning prompts, retrieval logic, or model providers clearly
- Waiting for user complaints to define drift
- Treating every quality change as model drift when workflow changes caused it
Warning signs you are doing it wrong
Signals that the playbook is being executed but not landing.
- Users say it feels worse but the team has no comparison evidence
- Quality changes are discussed socially rather than measured
- Different parts of the AI stack change without evaluation checkpoints
- The baseline quietly becomes stale and stops reflecting the real task
Outcomes
Outcomes and signals
What should exist after the playbook runs, how you'll know it worked, and what to watch for over time.
Artifacts to produce
Durable outputs the playbook should leave behind.
- Drift protection task set
- Behavior baseline
- Drift change log
- Drift monitoring dashboard
- Drift response playbook
Success signals
Observable changes that mean the playbook landed.
- Quality shifts are caught before widespread user dissatisfaction
- The team can localize likely causes of drift faster
- Model or retrieval changes are accompanied by meaningful evaluation
- Trust conversations move from vibes to evidence
Follow-up actions
Moves that keep the playbook's effects compounding after it finishes.
- Refresh task slices as user workflows evolve
- Retire stale baselines and replace them with living ones
- Connect drift incidents back into release and evaluation design
Metrics or signals to watch
Longer-horizon indicators that the underlying problem is receding.
- Task-slice performance over time
- User correction or override rate
- Complaint frequency by slice
- Time from quality shift to detection
- Number of untracked system changes affecting quality
AI impact
AI effects on this playbook
How AI-assisted and AI-driven workflows help execution, and the ways they can make it worse.
AI can help with
Where AI tooling genuinely reduces the cost of running this playbook well.
- Comparing outputs across model or prompt versions
- Clustering drift examples by failure pattern
- Summarizing which slices changed most between runs
AI can make worse by
Distortions AI introduces that make the underlying problem harder to see.
- Over-explaining drift with plausible but unverified narratives
- Normalizing drift because the model still sounds fluent
- Encouraging baseline churn so no stable comparison remains
AI synthesis
Drift is not only a model problem. Prompt, corpus, routing, and tool-use drift often matter just as much. The monitoring system should reflect that wider truth.
Relationships
Connected playbooks
Failure modes this playbook tends to address, decisions behind the situation, red flags that motivate running it, and neighboring playbooks.