Skip to main content
The Hard Parts.dev
RF-42 Ai · Ai Quality RF Red Flags
Severity high Freq increasing

Model drift is noticed informally, not measured

People sense that model or system behavior changed, but the organization lacks reliable measurement, alerts, or structured comparison.

Severity
high
Frequency
increasing
trend
First noticed by
users · support teams · AI evaluators
Detectability
subtle
Confidence
high
At a glanceRF-42
Where you see this

vendor model APIsRAG systemsprompt-evolving copilots

Not necessarily a problem when
a temporary exploratory prototype is explicitly not treated as stable
Often mistaken for
we would know if the model meaningfully changed
Time horizon
near-term
Best placed to act

AI engineerevaluation ownerproduct owner

The signal

What you would actually notice

Unmeasured drift weakens trust and delays corrective action.

Field observation

Users say the system feels different or worse, but the team cannot quantify when, how, or why behavior shifted.

Also observed

  • It feels worse lately, but we do not know why.
  • Users noticed the change before our monitoring did.

Primary reading

What it usually indicates

Most likely underlying patterns when this signal shows up. Not a diagnosis, a starting hypothesis.

Usually indicates

Most likely underlying patterns when this signal shows up.

  • weak eval monitoring
  • provider drift exposure
  • poor baseline comparison

Stakes

Why it matters

Unmeasured drift weakens trust and delays corrective action.

Inspection

What to check next

Deliberate steps to confirm or disconfirm the primary reading above. Not a checklist. An order of inspection.

  1. eval time series
  2. provider change logs
  3. prompt and workflow version history
  4. user complaint clusters

Diagnostic questions

Questions to ask the team, or yourself, before concluding anything.

  1. What baseline behavior are we comparing against?
  2. What changed in prompt, model, retrieval, or tooling?
  3. Which task slices are most vulnerable to drift?

Progression

Under the signal

Where this pattern tends to come from, what's holding it up, and where it goes if nothing changes.

Leading indicators

What tends to show up first.

  • user anecdotes drive concern before metrics do
  • qualitative complaints recur without quantified evidence
  • provider or prompt changes are poorly tracked

Common root causes

What is usually sitting under the signal.

  • weak evaluation harness
  • poor version tracking
  • insufficient monitoring

Likely consequences

What happens if nothing changes.

  • trust erosion
  • slow issue response
  • benchmark mirage

Look-alikes

Not what it looks like

Patterns that can be mistaken for this signal, and 'fix' attempts that make it worse.

False friends Things the signal is often confused with, but isn't.
  • we would know if the model meaningfully changed

Anti-patterns when responding

Responses that feel sensible and usually make the underlying pattern worse.

  • relying on anecdote without building drift measurement
  • treating provider updates as invisible infrastructure changes

Context

Context and ownership

Where this signal surfaces, who sees it first, who can actually act, and how much runway there usually is before escalation.

Common contexts

Where it shows up

  • vendor model APIs
  • RAG systems
  • prompt-evolving copilots
Most likely to notice

Who sees it first

Before it escalates.

  • users
  • support teams
  • AI evaluators
Best placed to act

Who can move on it

Not always the same as who notices it.

  • AI engineer
  • evaluation owner
  • product owner
Time horizon

near-term

How much runway there usually is before the signal hardens into the underlying pattern.

AI impact

AI effects on this signal

How AI-assisted and AI-driven workflows tend to amplify or hide this signal.

AI amplifies

Ways AI tooling tends to make this signal louder or more common.

  • This is an AI-specific operating risk.

AI masks

Ways AI tooling tends to hide this signal, so it keeps growing under the surface.

  • General system success can hide slice-specific degradation until users complain loudly enough.

Relationships

Connected signals

Related failure modes, decisions behind the signal, response playbooks, and neighboring red flags.