Skip to main content
The Hard Parts.dev
EP-04 Ai EP Engineering Playbook
Difficulty high Owner · AI engineer

Build a grounded RAG system

Design RAG around source trust, retrieval quality, task fit, and answer behavior so that citation and grounding mean something more than 'the model found text nearby'.

Difficulty
high
Time horizon
weeks to months
Primary owner
AI engineer
Confidence
high
At a glanceEP-04
Situation
A team wants retrieval-augmented generation that is genuinely grounded in trustworthy sources.
Goal
Build a RAG system that produces answers users can inspect, trust, and challenge because the retrieval and source model are explicit and meaningful.
Do not use when
the task is not really knowledge retrieval and needs a different architecture
Primary owner
AI engineer
Roles involved

AI engineerproduct ownerdomain expertknowledge-source ownersevaluation owner

Context

The situation

Deciding whether to reach for this playbook: when it fits, and when it doesn't.

Use when

Conditions where this playbook is the right tool.

  • The team is building a knowledge assistant, support assistant, or internal search-answer system
  • Users need source-backed answers rather than purely generative ones
  • The corpus contains mixed trust levels or mixed freshness levels
  • Hallucination or ungrounded synthesis would be costly

Stakes

Why this matters

What this playbook protects against, and why skipping or half-running it tends to be expensive.

Many RAG systems are only cosmetically grounded. They cite text, but not authoritative truth. Grounding is only useful when the system knows which sources count, how freshness matters, and when the model should refuse or narrow its claims.

Quality bar

What good looks like

The observable qualities of a team or system that is actually doing this well. Not just going through the motions.

Signs of the playbook done well

  • The source corpus is curated by trust and freshness, not only indexed by convenience
  • Retrieval matches the real task and answer form
  • Answers reflect source quality and uncertainty honestly
  • Users can inspect why the answer should be believed or doubted
  • Source citations help verification instead of creating fake authority

Preparation

Before you start

What you need available and true before running the procedure. Skipping this is the most common reason playbooks fail.

Inputs

Material you'll want to gather first.

  • Target user tasks
  • Candidate source corpus
  • Source trust model
  • Freshness requirements
  • Expected answer shapes
  • Evaluation scenarios

Prerequisites

Conditions that should be true for this to work.

  • The team can describe what users actually need to know or do
  • There is a way to distinguish authoritative from merely available sources
  • Some evaluation set or real task sample exists

Procedure

The procedure

Each step carries its purpose (why it exists), its actions (what you do), and its outputs (what you produce). Read the purpose. It's what keeps the step from degenerating into checklist theatre.

  1. Define what grounded must mean here

    Avoid cargo-cult RAG.

    Actions

    • State what kinds of claims require source support
    • Decide how freshness, authority, and completeness affect answer trust
    • Clarify when the system should answer, hedge, narrow, or refuse

    Outputs

    • Grounding definition
  2. Curate the corpus by trust and task relevance

    Prevent low-quality retrieval from becoming polished misinformation.

    Actions

    • Classify sources by authority, ownership, freshness, and use-case fit
    • Exclude or downgrade sources that humans would not trust in serious work
    • Prefer smaller high-trust corpora over broad low-trust indexing at first

    Outputs

    • Source trust model
    • Curated corpus
  3. Design retrieval for the user’s real task

    Make retrieval serve the job, not just semantic similarity.

    Actions

    • Choose chunking, indexing, and ranking strategies that preserve useful context
    • Handle structured, semi-structured, and long-form sources differently where needed
    • Test retrieval on real task queries, not only synthetic ones

    Outputs

    • Retrieval design
    • Task-based retrieval cases
  4. Constrain answer behavior

    Keep the model from overstating weak evidence.

    Actions

    • Require attribution patterns appropriate to the task
    • Shape the model to answer within supported evidence
    • Surface uncertainty when source support is partial or conflicting

    Outputs

    • Answer policy
  5. Evaluate retrieval, grounding, and usefulness together

    Avoid scoring the wrong thing.

    Actions

    • Measure whether the right sources were retrieved
    • Check whether the answer respected the evidence
    • Test whether users could verify or act from the answer safely

    Outputs

    • RAG evaluation pack

Judgment

Judgment calls and pitfalls

The places where execution actually diverges: decisions that need thought, questions worth asking, and mistakes that recur regardless of good intent.

Decision points

Moments where judgment and trade-offs matter more than procedure.

  • What sources are truly authoritative for this task?
  • How much freshness matters to the answer?
  • When should the system cite, abstain, or narrow the claim?
  • What retrieval failure matters most for users: missing key context, wrong source ranking, or stale truth?

Questions worth asking

Prompts to use on yourself, the team, or an AI assistant while running the procedure.

  • What counts as an authoritative source for this task?
  • Would a human trust the same sources the RAG system is citing?
  • When should the assistant refuse to answer because evidence is weak or conflicting?

Common mistakes

Patterns that surface across teams running this playbook.

  • Indexing everything because more data feels safer
  • Confusing accessible sources with authoritative sources
  • Treating citation presence as proof of grounding
  • Evaluating only answer fluency rather than evidence quality

Warning signs you are doing it wrong

Signals that the playbook is being executed but not landing.

  • Users say the answer cited sources but still felt untrustworthy
  • The system retrieves documents humans would never rely on directly
  • Evaluation looks good on retrieval recall but fails in real workflow use
  • The model sounds more certain than the source support allows

Outcomes

Outcomes and signals

What should exist after the playbook runs, how you'll know it worked, and what to watch for over time.

Artifacts to produce

Durable outputs the playbook should leave behind.

  • Grounding definition
  • Source trust model
  • Curated corpus
  • Retrieval design
  • Answer policy
  • RAG evaluation pack

Success signals

Observable changes that mean the playbook landed.

  • Users can trace answers back to trusted evidence
  • Low-trust source leakage declines
  • Real task usefulness improves without fake confidence increasing
  • Source and answer failures become easier to diagnose separately

Follow-up actions

Moves that keep the playbook's effects compounding after it finishes.

  • Review source trust and freshness continuously
  • Add task-specific retrieval cases as real user needs evolve
  • Connect repeated retrieval failures to corpus and ranking redesign

Metrics or signals to watch

Longer-horizon indicators that the underlying problem is receding.

  • High-trust source retrieval rate
  • Answer-evidence alignment rate
  • User verification success rate
  • Low-trust citation frequency
  • Freshness failure incidents

AI impact

AI effects on this playbook

How AI-assisted and AI-driven workflows help execution, and the ways they can make it worse.

AI can help with

Where AI tooling genuinely reduces the cost of running this playbook well.

  • Proposing chunking candidates and retrieval structures
  • Summarizing source differences and trust annotations
  • Drafting evaluation sets from real documents and user tasks

AI can make worse by

Distortions AI introduces that make the underlying problem harder to see.

  • Sounding more grounded than the sources justify
  • Masking weak retrieval with articulate synthesis
  • Encouraging broader corpus inclusion without trust discipline

Relationships

Connected playbooks

Failure modes this playbook tends to address, decisions behind the situation, red flags that motivate running it, and neighboring playbooks.