Skip to main content
The Hard Parts.dev
FM-27 ai FM Failure Modes
Severity high Freq increasing

Eval Goodhart

Internal evaluation sets become optimization targets rather than honest capability measures, producing models or prompts that score well but behave poorly in production.

Severity
high
Frequency
increasing
trend
Lifecycle
build · delivery
Recovery
medium
Confidence
high
At a glanceFM-27
Also known as

eval overfittingbenchmark poisoninginternal benchmark trapmetric-shaped evaluation

First noticed by

ai engineerstaff engineerproduct lead

Mistaken for
rigorous internal evaluation
Often mistaken as
effective AI quality management

Why it looks healthy

Concrete external tells that make the pattern read as responsible behavior.

  • Evaluation scores improve steadily over model or prompt versions
  • The eval pipeline is automated and polished
  • Leadership sees consistent "quality improvements"
  • Other teams cite the eval harness as best practice

Definition

What it is

Blast radius product trust business

A team builds an evaluation harness to measure AI system quality, and then optimizes the system to perform on that harness rather than on real user tasks.

How it unfolds

The arc of the pattern

  1. Starts

    A team builds an eval suite to measure quality and sets improvement targets.

  2. Feels reasonable because

    Evals are a legitimate and important quality signal.

  3. Escalates

    The eval suite becomes the primary measure. Changes are made that improve eval scores without improving real behavior.

  4. Ends

    The system scores well internally but users report deteriorating quality. The eval set has been implicitly overfitted.

Recognition

Warning signs by stage

Observable signals as the pattern progresses.

EARLY

Early

  • Eval scores are the primary success metric for AI quality.
  • Eval set is not regularly refreshed with new real-world cases.
  • Changes that hurt eval scores are rejected regardless of user signal.

MID

Mid

  • Eval scores improve but user complaints do not decrease.
  • The eval set has not changed significantly in months.
  • Production behavior diverges from eval behavior in specific scenarios.

LATE

Late

  • A production failure occurs in a scenario the eval does not cover.
  • The team realizes the eval set reflects old use cases, not current ones.
  • Trust in the eval as a quality signal collapses internally.

Root causes

Why it happens

  • Evals are treated as targets rather than signals
  • Eval sets are not maintained alongside production behavior
  • Goodhart's Law applies to internal evals as much as external benchmarks
  • Production feedback loop is weak or slow

Response

What to do

Immediate triage first, then structural fixes.

First move

Swap in a fresh eval set drawn from last week's production cases and re-score the current system - if the score regresses, the previous number was overfit.

Hard trade-off

Accept "worse" scores on a harder eval set, or accept scores that stopped reflecting production behavior some time ago.

Recovery trap

Expanding the existing eval set with more cases of the same kind, which doesn't help if optimization pressure stays pointed at it.

Immediate actions

  • Refresh the eval set with recent production cases
  • Add a production behavior signal alongside eval scores
  • Review cases where eval score and user signal diverge

Structural fixes

  • Treat eval sets as living documents tied to current use cases
  • Pair eval scores with human review and production sampling
  • Separate the eval set from the optimization target explicitly

What not to do

  • Do not abandon evals because they can be gamed
  • Do not treat eval score improvement as equivalent to user experience improvement

AI impact

How AI distorts this pattern

Where AI-assisted workflows accelerate, hide, or help with this failure mode.

AI can help with

  • AI can help generate diverse and adversarial eval cases that are harder to overfit and better represent real production edge cases.

AI can make worse by

  • AI can optimize prompts and configurations to score well on known eval cases, accelerating overfitting without surfacing production risk.

Relationships

Connected patterns

Causal flows inside Failure Modes, and related entries across the site.

Easy to confuse with

Nearby patterns and how this one differs.

  • Benchmark mirage is trusting someone else's score. Eval Goodhart is trusting your own score as it loses meaning.

  • Metric myopia is the general pattern. Eval Goodhart is the AI-specific version for behavioral evaluation.

  • Drift is model behavior moving without you. Eval Goodhart is your eval staying still while you optimize past its signal.

Heard in the wild

What it sounds like

The phrase that signals the pattern is about to start, and who tends to say it.

Heard in the wild

Our eval scores are up 8% this sprint.

Said byai engineer in a team review

Notes from practice

What experienced people notice

Annotations from engineers who have worked this pattern before.

Best momentWhen intervention actually changes the trajectory.
When eval scores improve faster than production quality does
Counter moveThe specific action that breaks the pattern.
If the eval set has not changed, the score improvement may be overfitting, not progress.
False positiveWhen this pattern is actually the correct call.
Evals are essential. The failure mode begins when the eval becomes the target rather than the signal.