Skip to main content
The Hard Parts.dev
RF-39 Ai · Ai Quality RF Red Flags
Severity high Freq increasing

Benchmarks are discussed more than real user outcomes

Teams spend more time on benchmark scores and synthetic eval wins than on whether the system helps real users in real tasks.

Severity
high
Frequency
increasing
trend
First noticed by
AI product lead · evaluators · careful users
Detectability
visible-if-you-look
Confidence
high
At a glanceRF-39
Where you see this

LLM product teamsmodel selectionresearch-to-product transitions

Not necessarily a problem when
the team is still pre-launch and explicitly using benchmarks as temporary screening proxies
Often mistaken for
if the benchmark is strong, the product must be strong
Time horizon
near-term
Best placed to act

AI leadproduct lead

The signal

What you would actually notice

Proxy success can become a substitute for product truth.

Field observation

Model or system discussions center on scores, leaderboards, or eval deltas more than real-world workflow quality.

Also observed

  • The benchmark is up 4 points, so we are clearly better.
  • We do not yet have real user data, but the evals look strong.

Primary reading

What it usually indicates

Most likely underlying patterns when this signal shows up. Not a diagnosis, a starting hypothesis.

Usually indicates

Most likely underlying patterns when this signal shows up.

  • benchmark mirage
  • evaluation convenience bias
  • weak real-world measurement

Stakes

Why it matters

Proxy success can become a substitute for product truth.

Inspection

What to check next

Deliberate steps to confirm or disconfirm the primary reading above. Not a checklist. An order of inspection.

  1. real task evaluations
  2. user complaints
  3. shadow-mode results
  4. failure clusters

Diagnostic questions

Questions to ask the team, or yourself, before concluding anything.

  1. What user task got better?
  2. Which benchmark result actually predicts user value here?
  3. Where do our benchmarks diverge from production?

Progression

Under the signal

Where this pattern tends to come from, what's holding it up, and where it goes if nothing changes.

Leading indicators

What tends to show up first.

  • user outcomes are harder to state than eval scores
  • decisions are made from benchmark deltas alone
  • real-world incidents do not change benchmark enthusiasm

Common root causes

What is usually sitting under the signal.

  • measurement convenience
  • research culture carryover
  • missing product-grounded eval design

Likely consequences

What happens if nothing changes.

  • benchmark mirage
  • misallocated effort
  • user disappointment

Look-alikes

Not what it looks like

Patterns that can be mistaken for this signal, and 'fix' attempts that make it worse.

False friends Things the signal is often confused with, but isn't.
  • if the benchmark is strong, the product must be strong

Anti-patterns when responding

Responses that feel sensible and usually make the underlying pattern worse.

  • shipping because the benchmark improved
  • tuning to evaluation sets rather than user tasks

Context

Context and ownership

Where this signal surfaces, who sees it first, who can actually act, and how much runway there usually is before escalation.

Common contexts

Where it shows up

  • LLM product teams
  • model selection
  • research-to-product transitions
Most likely to notice

Who sees it first

Before it escalates.

  • AI product lead
  • evaluators
  • careful users
Best placed to act

Who can move on it

Not always the same as who notices it.

  • AI lead
  • product lead
Time horizon

near-term

How much runway there usually is before the signal hardens into the underlying pattern.

AI impact

AI effects on this signal

How AI-assisted and AI-driven workflows tend to amplify or hide this signal.

AI amplifies

Ways AI tooling tends to make this signal louder or more common.

  • AI product culture is especially vulnerable to proxy overfocus.

AI masks

Ways AI tooling tends to hide this signal, so it keeps growing under the surface.

  • Quantitative eval gains can feel more solid than messy real-world evidence.

Relationships

Connected signals

Related failure modes, decisions behind the signal, response playbooks, and neighboring red flags.