RAG Without Ground Truth · thehardparts.dev

Severity: high
Frequency: increasing
Lifecycle: build · operate
Recovery: hard
Confidence: high

At a glanceFM-17

Also known as: retrieval theaterconfident hallucination pipelinecitation placeboknowledge assistant without knowledge
First noticed by: ai engineersupport leaddomain expert
Mistaken for: knowledge assistant progress
Often mistaken as: working knowledge retrieval

Why it looks healthy

Concrete external tells that make the pattern read as responsible behavior.

Answers include citations and sound authoritative
Demos impress stakeholders and pass without pushback
Retrieval returns plausible-looking results every time
The system never says "I don't know"

Definition

What it is

Blast radius product trust business

A RAG pipeline retrieves from untrusted or unvalidated sources and presents confident answers without ground-truth verification.

How it unfolds

The arc of the pattern

Starts

A team builds a retrieval system over a corpus of documents and it produces fluent, plausible answers.
Feels reasonable because

Answers look good and demos are impressive.
Escalates

Answers are confidently wrong. Sources are stale, out of scope, or misinterpreted. Users trust the system more than they should.
Ends

A significant error surfaces publicly or operationally. The system loses credibility that is hard to recover.

Recognition

Warning signs by stage

Observable signals as the pattern progresses.

EARLY

Early

Source trust and freshness are vague.
Evaluation is based on demos rather than systematic testing.
No human expert has reviewed representative outputs.

MID

Mid

Users encounter confident wrong answers.
Citations point to sources that do not support the claim.
Edge cases and out-of-scope questions are handled poorly.

LATE

Late

User distrust becomes explicit.
Support receives escalations about false information.
The team cannot explain which sources are authoritative.

Root causes

Why it happens

Source quality is ignored in favor of fluency
Evaluation is weak or demo-based
No ground-truth corpus was built before deployment
Hallucination risk is underestimated

Response

What to do

Immediate triage first, then structural fixes.

First move

Define what the system should refuse to answer, and test that behavior before you test what it can answer.

Hard trade-off

Accept narrower scope - fewer answers, more refusals - so the answers given are defensible.

Recovery trap

Adding more sources to the index, which increases surface area for plausible-sounding but ungrounded output.

Immediate actions

Define and publish the trusted source list
Add citation requirements and source verification
Build a hallucination test set from known facts in the domain

Structural fixes

Maintain a curated, versioned ground-truth corpus
Run regular answer quality audits with domain experts
Build fallback responses for out-of-scope queries

What not to do

Do not treat fluency as accuracy
Do not deploy without domain expert sign-off on representative outputs

AI impact

How AI distorts this pattern

Where AI-assisted workflows accelerate, hide, or help with this failure mode.

AI can help with

AI can improve discoverability and summarization if retrieval and validation are disciplined.

AI can make worse by

AI makes plausible-sounding nonsense feel authoritative and confident, which is the core risk of this failure mode.

Relationships

Connected patterns

Causal flows inside Failure Modes, and related entries across the site.

Easy to confuse with

Nearby patterns and how this one differs.

FM-14 Silent Model Drift

Drift is change over time. This one is launched broken.
FM-28 Context Window Hoarding

Context hoarding adds content hoping quality improves. RAG without ground truth is the same impulse at retrieval time.
Adjacent concept Legitimate RAG deployment

Real RAG has validated sources, adversarial evals, and explicit don't-know paths. Everything else is a demo in production.

Heard in the wild

What it sounds like

The phrase that signals the pattern is about to start, and who tends to say it.

Heard in the wild

The demo looks great. It's pulling from our docs.

Said byproduct manager or ai engineer after a demo

Notes from practice

What experienced people notice

Annotations from engineers who have worked this pattern before.

Best momentWhen intervention actually changes the trajectory.: Before the system is presented to users as authoritative
Counter moveThe specific action that breaks the pattern.: Ask what happens when it does not know the answer before asking what happens when it does.
False positiveWhen this pattern is actually the correct call.: RAG is a powerful pattern. The failure mode is deploying it without ground truth, not using it.