Eval Goodhart
Internal evaluation sets become optimization targets rather than honest capability measures, producing models or prompts that score well but behave poorly in production.
- Also known as
eval overfittingbenchmark poisoninginternal benchmark trapmetric-shaped evaluation
- First noticed by
ai engineerstaff engineerproduct lead
- Mistaken for
- rigorous internal evaluation
- Often mistaken as
- effective AI quality management
Why it looks healthy
Concrete external tells that make the pattern read as responsible behavior.
- Evaluation scores improve steadily over model or prompt versions
- The eval pipeline is automated and polished
- Leadership sees consistent "quality improvements"
- Other teams cite the eval harness as best practice
Definition
What it is
Blast radius product trust business
A team builds an evaluation harness to measure AI system quality, and then optimizes the system to perform on that harness rather than on real user tasks.
How it unfolds
The arc of the pattern
-
Starts
A team builds an eval suite to measure quality and sets improvement targets.
-
Feels reasonable because
Evals are a legitimate and important quality signal.
-
Escalates
The eval suite becomes the primary measure. Changes are made that improve eval scores without improving real behavior.
-
Ends
The system scores well internally but users report deteriorating quality. The eval set has been implicitly overfitted.
Recognition
Warning signs by stage
Observable signals as the pattern progresses.
EARLY
Early
- Eval scores are the primary success metric for AI quality.
- Eval set is not regularly refreshed with new real-world cases.
- Changes that hurt eval scores are rejected regardless of user signal.
MID
Mid
- Eval scores improve but user complaints do not decrease.
- The eval set has not changed significantly in months.
- Production behavior diverges from eval behavior in specific scenarios.
LATE
Late
- A production failure occurs in a scenario the eval does not cover.
- The team realizes the eval set reflects old use cases, not current ones.
- Trust in the eval as a quality signal collapses internally.
Root causes
Why it happens
- Evals are treated as targets rather than signals
- Eval sets are not maintained alongside production behavior
- Goodhart's Law applies to internal evals as much as external benchmarks
- Production feedback loop is weak or slow
Response
What to do
Immediate triage first, then structural fixes.
First move
Swap in a fresh eval set drawn from last week's production cases and re-score the current system - if the score regresses, the previous number was overfit.
Hard trade-off
Accept "worse" scores on a harder eval set, or accept scores that stopped reflecting production behavior some time ago.
Recovery trap
Expanding the existing eval set with more cases of the same kind, which doesn't help if optimization pressure stays pointed at it.
Immediate actions
- Refresh the eval set with recent production cases
- Add a production behavior signal alongside eval scores
- Review cases where eval score and user signal diverge
Structural fixes
- Treat eval sets as living documents tied to current use cases
- Pair eval scores with human review and production sampling
- Separate the eval set from the optimization target explicitly
What not to do
- Do not abandon evals because they can be gamed
- Do not treat eval score improvement as equivalent to user experience improvement
AI impact
How AI distorts this pattern
Where AI-assisted workflows accelerate, hide, or help with this failure mode.
AI can help with
- AI can help generate diverse and adversarial eval cases that are harder to overfit and better represent real production edge cases.
AI can make worse by
- AI can optimize prompts and configurations to score well on known eval cases, accelerating overfitting without surfacing production risk.
AI false confidence
AI systems can be tuned to hit known eval cases almost perfectly while behaving worse on everything else - the score keeps rising as the signal keeps shrinking.
AI synthesis
An eval that cannot surprise you is not testing anything you do not already know.
Relationships
Connected patterns
Causal flows inside Failure Modes, and related entries across the site.
Easy to confuse with
Nearby patterns and how this one differs.
-
Benchmark mirage is trusting someone else's score. Eval Goodhart is trusting your own score as it loses meaning.
-
Metric myopia is the general pattern. Eval Goodhart is the AI-specific version for behavioral evaluation.
-
Drift is model behavior moving without you. Eval Goodhart is your eval staying still while you optimize past its signal.
Heard in the wild
What it sounds like
The phrase that signals the pattern is about to start, and who tends to say it.
Our eval scores are up 8% this sprint.
Said byai engineer in a team review
Notes from practice
What experienced people notice
Annotations from engineers who have worked this pattern before.
- Best momentWhen intervention actually changes the trajectory.
- When eval scores improve faster than production quality does
- Counter moveThe specific action that breaks the pattern.
- If the eval set has not changed, the score improvement may be overfitting, not progress.
- False positiveWhen this pattern is actually the correct call.
- Evals are essential. The failure mode begins when the eval becomes the target rather than the signal.