Model drift is noticed informally, not measured
People sense that model or system behavior changed, but the organization lacks reliable measurement, alerts, or structured comparison.
- Where you see this
vendor model APIsRAG systemsprompt-evolving copilots
- Not necessarily a problem when
- a temporary exploratory prototype is explicitly not treated as stable
- Often mistaken for
- we would know if the model meaningfully changed
- Time horizon
- near-term
- Best placed to act
AI engineerevaluation ownerproduct owner
The signal
What you would actually notice
Unmeasured drift weakens trust and delays corrective action.
Field observation
Users say the system feels different or worse, but the team cannot quantify when, how, or why behavior shifted.
Also observed
- It feels worse lately, but we do not know why.
- Users noticed the change before our monitoring did.
Primary reading
What it usually indicates
Most likely underlying patterns when this signal shows up. Not a diagnosis, a starting hypothesis.
Usually indicates
Most likely underlying patterns when this signal shows up.
- weak eval monitoring
- provider drift exposure
- poor baseline comparison
Not necessarily a problem when
Contexts where this signal is expected and does not indicate a deeper issue.
- a temporary exploratory prototype is explicitly not treated as stable
Stakes
Why it matters
Unmeasured drift weakens trust and delays corrective action.
Heuristic
If drift is discovered by vibes first, monitoring is underpowered.
Inspection
What to check next
Deliberate steps to confirm or disconfirm the primary reading above. Not a checklist. An order of inspection.
- eval time series
- provider change logs
- prompt and workflow version history
- user complaint clusters
Diagnostic questions
Questions to ask the team, or yourself, before concluding anything.
- What baseline behavior are we comparing against?
- What changed in prompt, model, retrieval, or tooling?
- Which task slices are most vulnerable to drift?
Progression
Under the signal
Where this pattern tends to come from, what's holding it up, and where it goes if nothing changes.
Leading indicators
What tends to show up first.
- user anecdotes drive concern before metrics do
- qualitative complaints recur without quantified evidence
- provider or prompt changes are poorly tracked
Common root causes
What is usually sitting under the signal.
- weak evaluation harness
- poor version tracking
- insufficient monitoring
Likely consequences
What happens if nothing changes.
- trust erosion
- slow issue response
- benchmark mirage
Look-alikes
Not what it looks like
Patterns that can be mistaken for this signal, and 'fix' attempts that make it worse.
- we would know if the model meaningfully changed
Anti-patterns when responding
Responses that feel sensible and usually make the underlying pattern worse.
- relying on anecdote without building drift measurement
- treating provider updates as invisible infrastructure changes
Context
Context and ownership
Where this signal surfaces, who sees it first, who can actually act, and how much runway there usually is before escalation.
Where it shows up
- vendor model APIs
- RAG systems
- prompt-evolving copilots
Who sees it first
Before it escalates.
- users
- support teams
- AI evaluators
Who can move on it
Not always the same as who notices it.
- AI engineer
- evaluation owner
- product owner
near-term
How much runway there usually is before the signal hardens into the underlying pattern.
AI impact
AI effects on this signal
How AI-assisted and AI-driven workflows tend to amplify or hide this signal.
AI amplifies
Ways AI tooling tends to make this signal louder or more common.
- This is an AI-specific operating risk.
AI masks
Ways AI tooling tends to hide this signal, so it keeps growing under the surface.
- General system success can hide slice-specific degradation until users complain loudly enough.
AI synthesis
Provider or pipeline changes alter behavior while the team lacks a living baseline.
Relationships
Connected signals
Related failure modes, decisions behind the signal, response playbooks, and neighboring red flags.