Eval Goodhart · thehardparts.dev

Severity: high
Frequency: increasing
Lifecycle: build · delivery
Recovery: medium
Confidence: high

At a glanceFM-27

Also known as: eval overfittingbenchmark poisoninginternal benchmark trapmetric-shaped evaluation
First noticed by: ai engineerstaff engineerproduct lead
Mistaken for: rigorous internal evaluation
Often mistaken as: effective AI quality management

Why it looks healthy

Concrete external tells that make the pattern read as responsible behavior.

Evaluation scores improve steadily over model or prompt versions
The eval pipeline is automated and polished
Leadership sees consistent "quality improvements"
Other teams cite the eval harness as best practice

Definition

What it is

Blast radius product trust business

A team builds an evaluation harness to measure AI system quality, and then optimizes the system to perform on that harness rather than on real user tasks.

How it unfolds

The arc of the pattern

Starts

A team builds an eval suite to measure quality and sets improvement targets.
Feels reasonable because

Evals are a legitimate and important quality signal.
Escalates

The eval suite becomes the primary measure. Changes are made that improve eval scores without improving real behavior.
Ends

The system scores well internally but users report deteriorating quality. The eval set has been implicitly overfitted.

Recognition

Warning signs by stage

Observable signals as the pattern progresses.

EARLY

Early

Eval scores are the primary success metric for AI quality.
Eval set is not regularly refreshed with new real-world cases.
Changes that hurt eval scores are rejected regardless of user signal.

MID

Mid

Eval scores improve but user complaints do not decrease.
The eval set has not changed significantly in months.
Production behavior diverges from eval behavior in specific scenarios.

LATE

Late

A production failure occurs in a scenario the eval does not cover.
The team realizes the eval set reflects old use cases, not current ones.
Trust in the eval as a quality signal collapses internally.

Root causes

Why it happens

Evals are treated as targets rather than signals
Eval sets are not maintained alongside production behavior
Goodhart's Law applies to internal evals as much as external benchmarks
Production feedback loop is weak or slow

Response

What to do

Immediate triage first, then structural fixes.

First move

Swap in a fresh eval set drawn from last week's production cases and re-score the current system - if the score regresses, the previous number was overfit.

Hard trade-off

Accept "worse" scores on a harder eval set, or accept scores that stopped reflecting production behavior some time ago.

Recovery trap

Expanding the existing eval set with more cases of the same kind, which doesn't help if optimization pressure stays pointed at it.

Immediate actions

Refresh the eval set with recent production cases
Add a production behavior signal alongside eval scores
Review cases where eval score and user signal diverge

Structural fixes

Treat eval sets as living documents tied to current use cases
Pair eval scores with human review and production sampling
Separate the eval set from the optimization target explicitly

What not to do

Do not abandon evals because they can be gamed
Do not treat eval score improvement as equivalent to user experience improvement

AI impact

How AI distorts this pattern

Where AI-assisted workflows accelerate, hide, or help with this failure mode.

AI can help with

AI can help generate diverse and adversarial eval cases that are harder to overfit and better represent real production edge cases.

AI can make worse by

AI can optimize prompts and configurations to score well on known eval cases, accelerating overfitting without surfacing production risk.

Relationships

Connected patterns

Causal flows inside Failure Modes, and related entries across the site.

Easy to confuse with

Nearby patterns and how this one differs.

FM-16 The Benchmark Mirage

Benchmark mirage is trusting someone else's score. Eval Goodhart is trusting your own score as it loses meaning.
FM-10 Metric Myopia

Metric myopia is the general pattern. Eval Goodhart is the AI-specific version for behavioral evaluation.
FM-14 Silent Model Drift

Drift is model behavior moving without you. Eval Goodhart is your eval staying still while you optimize past its signal.

Heard in the wild

What it sounds like

The phrase that signals the pattern is about to start, and who tends to say it.

Heard in the wild

Our eval scores are up 8% this sprint.

Said byai engineer in a team review

Notes from practice

What experienced people notice

Annotations from engineers who have worked this pattern before.

Best momentWhen intervention actually changes the trajectory.: When eval scores improve faster than production quality does
Counter moveThe specific action that breaks the pattern.: If the eval set has not changed, the score improvement may be overfitting, not progress.
False positiveWhen this pattern is actually the correct call.: Evals are essential. The failure mode begins when the eval becomes the target rather than the signal.