Build a grounded RAG system · thehardparts.dev

Difficulty: high
Time horizon: weeks to months
Primary owner: AI engineer
Confidence: high

At a glanceEP-04

Situation: A team wants retrieval-augmented generation that is genuinely grounded in trustworthy sources.
Goal: Build a RAG system that produces answers users can inspect, trust, and challenge because the retrieval and source model are explicit and meaningful.
Do not use when: the task is not really knowledge retrieval and needs a different architecture
Primary owner: AI engineer
Roles involved: AI engineerproduct ownerdomain expertknowledge-source ownersevaluation owner

Context

The situation

Deciding whether to reach for this playbook: when it fits, and when it doesn't.

Use when

Conditions where this playbook is the right tool.

The team is building a knowledge assistant, support assistant, or internal search-answer system
Users need source-backed answers rather than purely generative ones
The corpus contains mixed trust levels or mixed freshness levels
Hallucination or ungrounded synthesis would be costly

Stakes

Why this matters

What this playbook protects against, and why skipping or half-running it tends to be expensive.

Many RAG systems are only cosmetically grounded. They cite text, but not authoritative truth. Grounding is only useful when the system knows which sources count, how freshness matters, and when the model should refuse or narrow its claims.

Quality bar

What good looks like

The observable qualities of a team or system that is actually doing this well. Not just going through the motions.

Signs of the playbook done well

The source corpus is curated by trust and freshness, not only indexed by convenience
Retrieval matches the real task and answer form
Answers reflect source quality and uncertainty honestly
Users can inspect why the answer should be believed or doubted
Source citations help verification instead of creating fake authority

Preparation

Before you start

What you need available and true before running the procedure. Skipping this is the most common reason playbooks fail.

Inputs

Material you'll want to gather first.

Target user tasks
Candidate source corpus
Source trust model
Freshness requirements
Expected answer shapes
Evaluation scenarios

Prerequisites

Conditions that should be true for this to work.

The team can describe what users actually need to know or do
There is a way to distinguish authoritative from merely available sources
Some evaluation set or real task sample exists

Procedure

The procedure

Each step carries its purpose (why it exists), its actions (what you do), and its outputs (what you produce). Read the purpose. It's what keeps the step from degenerating into checklist theatre.

01
Define what grounded must mean here
Avoid cargo-cult RAG.
Actions
- State what kinds of claims require source support
- Decide how freshness, authority, and completeness affect answer trust
- Clarify when the system should answer, hedge, narrow, or refuse
Outputs
- Grounding definition
02
Curate the corpus by trust and task relevance
Prevent low-quality retrieval from becoming polished misinformation.
Actions
- Classify sources by authority, ownership, freshness, and use-case fit
- Exclude or downgrade sources that humans would not trust in serious work
- Prefer smaller high-trust corpora over broad low-trust indexing at first
Outputs
- Source trust model
- Curated corpus
03
Design retrieval for the user’s real task
Make retrieval serve the job, not just semantic similarity.
Actions
- Choose chunking, indexing, and ranking strategies that preserve useful context
- Handle structured, semi-structured, and long-form sources differently where needed
- Test retrieval on real task queries, not only synthetic ones
Outputs
- Retrieval design
- Task-based retrieval cases
04
Constrain answer behavior
Keep the model from overstating weak evidence.
Actions
- Require attribution patterns appropriate to the task
- Shape the model to answer within supported evidence
- Surface uncertainty when source support is partial or conflicting
Outputs
- Answer policy
05
Evaluate retrieval, grounding, and usefulness together
Avoid scoring the wrong thing.
Actions
- Measure whether the right sources were retrieved
- Check whether the answer respected the evidence
- Test whether users could verify or act from the answer safely
Outputs
- RAG evaluation pack

Judgment

Judgment calls and pitfalls

The places where execution actually diverges: decisions that need thought, questions worth asking, and mistakes that recur regardless of good intent.

Decision points

Moments where judgment and trade-offs matter more than procedure.

What sources are truly authoritative for this task?
How much freshness matters to the answer?
When should the system cite, abstain, or narrow the claim?
What retrieval failure matters most for users: missing key context, wrong source ranking, or stale truth?

Questions worth asking

Prompts to use on yourself, the team, or an AI assistant while running the procedure.

What counts as an authoritative source for this task?
Would a human trust the same sources the RAG system is citing?
When should the assistant refuse to answer because evidence is weak or conflicting?

Common mistakes

Patterns that surface across teams running this playbook.

Indexing everything because more data feels safer
Confusing accessible sources with authoritative sources
Treating citation presence as proof of grounding
Evaluating only answer fluency rather than evidence quality

Warning signs you are doing it wrong

Signals that the playbook is being executed but not landing.

Users say the answer cited sources but still felt untrustworthy
The system retrieves documents humans would never rely on directly
Evaluation looks good on retrieval recall but fails in real workflow use
The model sounds more certain than the source support allows

Outcomes

Outcomes and signals

What should exist after the playbook runs, how you'll know it worked, and what to watch for over time.

Artifacts to produce

Durable outputs the playbook should leave behind.

Grounding definition
Source trust model
Curated corpus
Retrieval design
Answer policy
RAG evaluation pack

Success signals

Observable changes that mean the playbook landed.

Users can trace answers back to trusted evidence
Low-trust source leakage declines
Real task usefulness improves without fake confidence increasing
Source and answer failures become easier to diagnose separately

Follow-up actions

Moves that keep the playbook's effects compounding after it finishes.

Review source trust and freshness continuously
Add task-specific retrieval cases as real user needs evolve
Connect repeated retrieval failures to corpus and ranking redesign

Metrics or signals to watch

Longer-horizon indicators that the underlying problem is receding.

High-trust source retrieval rate
Answer-evidence alignment rate
User verification success rate
Low-trust citation frequency
Freshness failure incidents

AI impact

AI effects on this playbook

How AI-assisted and AI-driven workflows help execution, and the ways they can make it worse.

AI can help with

Where AI tooling genuinely reduces the cost of running this playbook well.

Proposing chunking candidates and retrieval structures
Summarizing source differences and trust annotations
Drafting evaluation sets from real documents and user tasks

AI can make worse by

Distortions AI introduces that make the underlying problem harder to see.

Sounding more grounded than the sources justify
Masking weak retrieval with articulate synthesis
Encouraging broader corpus inclusion without trust discipline

Relationships

Connected playbooks

Failure modes this playbook tends to address, decisions behind the situation, red flags that motivate running it, and neighboring playbooks.