Skip to main content
The Hard Parts.dev
FM-17 ai FM Failure Modes
Severity high Freq increasing

RAG Without Ground Truth

A retrieval-augmented system is built and deployed before source quality, citation reliability, and answer validation are established.

Severity
high
Frequency
increasing
trend
Lifecycle
build · operate
Recovery
hard
Confidence
high
At a glanceFM-17
Also known as

retrieval theaterconfident hallucination pipelinecitation placeboknowledge assistant without knowledge

First noticed by

ai engineersupport leaddomain expert

Mistaken for
knowledge assistant progress
Often mistaken as
working knowledge retrieval

Why it looks healthy

Concrete external tells that make the pattern read as responsible behavior.

  • Answers include citations and sound authoritative
  • Demos impress stakeholders and pass without pushback
  • Retrieval returns plausible-looking results every time
  • The system never says "I don't know"

Definition

What it is

Blast radius product trust business

A RAG pipeline retrieves from untrusted or unvalidated sources and presents confident answers without ground-truth verification.

How it unfolds

The arc of the pattern

  1. Starts

    A team builds a retrieval system over a corpus of documents and it produces fluent, plausible answers.

  2. Feels reasonable because

    Answers look good and demos are impressive.

  3. Escalates

    Answers are confidently wrong. Sources are stale, out of scope, or misinterpreted. Users trust the system more than they should.

  4. Ends

    A significant error surfaces publicly or operationally. The system loses credibility that is hard to recover.

Recognition

Warning signs by stage

Observable signals as the pattern progresses.

EARLY

Early

  • Source trust and freshness are vague.
  • Evaluation is based on demos rather than systematic testing.
  • No human expert has reviewed representative outputs.

MID

Mid

  • Users encounter confident wrong answers.
  • Citations point to sources that do not support the claim.
  • Edge cases and out-of-scope questions are handled poorly.

LATE

Late

  • User distrust becomes explicit.
  • Support receives escalations about false information.
  • The team cannot explain which sources are authoritative.

Root causes

Why it happens

  • Source quality is ignored in favor of fluency
  • Evaluation is weak or demo-based
  • No ground-truth corpus was built before deployment
  • Hallucination risk is underestimated

Response

What to do

Immediate triage first, then structural fixes.

First move

Define what the system should refuse to answer, and test that behavior before you test what it can answer.

Hard trade-off

Accept narrower scope - fewer answers, more refusals - so the answers given are defensible.

Recovery trap

Adding more sources to the index, which increases surface area for plausible-sounding but ungrounded output.

Immediate actions

  • Define and publish the trusted source list
  • Add citation requirements and source verification
  • Build a hallucination test set from known facts in the domain

Structural fixes

  • Maintain a curated, versioned ground-truth corpus
  • Run regular answer quality audits with domain experts
  • Build fallback responses for out-of-scope queries

What not to do

  • Do not treat fluency as accuracy
  • Do not deploy without domain expert sign-off on representative outputs

AI impact

How AI distorts this pattern

Where AI-assisted workflows accelerate, hide, or help with this failure mode.

AI can help with

  • AI can improve discoverability and summarization if retrieval and validation are disciplined.

AI can make worse by

  • AI makes plausible-sounding nonsense feel authoritative and confident, which is the core risk of this failure mode.

Relationships

Connected patterns

Causal flows inside Failure Modes, and related entries across the site.

Easy to confuse with

Nearby patterns and how this one differs.

  • Drift is change over time. This one is launched broken.

  • Context hoarding adds content hoping quality improves. RAG without ground truth is the same impulse at retrieval time.

  • Adjacent concept Legitimate RAG deployment

    Real RAG has validated sources, adversarial evals, and explicit don't-know paths. Everything else is a demo in production.

Heard in the wild

What it sounds like

The phrase that signals the pattern is about to start, and who tends to say it.

Heard in the wild

The demo looks great. It's pulling from our docs.

Said byproduct manager or ai engineer after a demo

Notes from practice

What experienced people notice

Annotations from engineers who have worked this pattern before.

Best momentWhen intervention actually changes the trajectory.
Before the system is presented to users as authoritative
Counter moveThe specific action that breaks the pattern.
Ask what happens when it does not know the answer before asking what happens when it does.
False positiveWhen this pattern is actually the correct call.
RAG is a powerful pattern. The failure mode is deploying it without ground truth, not using it.