RAG vs Fine-Tuning · thehardparts.dev

Severity if wrong: high
Frequency: increasing
Audiences: AI engineers · ML engineers · AI product teams · architects
Reversibility: moderate
Confidence: high

At a glanceTD-33

Really about: Where truth lives, what needs to change, and whether the core problem is missing knowledge or missing learned behavior.
Not actually about: Which technique is more advanced.
Why it feels hard: Both are often discussed as generic upgrades, but they solve different classes of problems.

The decision

Should this AI capability rely on retrieval-augmented generation or model fine-tuning?

Usually a knowledge-grounding vs behavior-shaping decision.

Default stance

Where to start before any evidence arrives.

Prefer RAG for changing knowledge and grounding; use fine-tuning when the problem is behavior, not missing context.

Options on the table

Two poles of the trade-off

Neither is the right answer by default. Each option's conditions, strengths, costs, hidden costs, and failure modes when misused are laid out in parallel so you can read across facets.

Option A

RAG

Best when

Conditions where this option is a natural fit.

knowledge changes frequently
source grounding matters
answers should reference trusted content
behavior is acceptable but context is missing

Real-world fits

Concrete environments where this option has worked.

knowledge assistants
internal enterprise search and answer surfaces
support systems grounded in living documentation

Strengths

What this option does well on its own terms.

freshness of knowledge
better grounding
less need to retrain for content change

Costs

What you accept up front to get those strengths.

retrieval quality dependency
citation and grounding complexity
context assembly overhead

Hidden costs

Costs that surface later than expected — the main thing novices miss.

weak corpus quality poisons the whole system
retrieval confidence can look better than truth quality

Failure modes when misused

How this option breaks when applied to the wrong context.

Leads to RAG without ground truth.

Option B

Fine-Tuning

Best when

Conditions where this option is a natural fit.

behavior style or patterning must change
task shape is stable
domain behavior matters more than changing knowledge

Real-world fits

Concrete environments where this option has worked.

stable classification or transformation tasks
style and format specialization
well-bounded domain behavior adaptation

Strengths

What this option does well on its own terms.

behavior adaptation
potential task specialization
less runtime retrieval complexity

Costs

What you accept up front to get those strengths.

training effort
evaluation burden
knowledge freshness is not automatic

Hidden costs

Costs that surface later than expected — the main thing novices miss.

teams may try to tune behavior to compensate for bad data or weak grounding
evaluation complexity rises quickly

Failure modes when misused

How this option breaks when applied to the wrong context.

Creates specialized behavior without trustworthy truth sources or robust evaluation.

Cost, time, and reversibility

Who pays, how it ages, and what undoing it costs

Trade-offs are rarely zero-sum and rarely static. Someone pays, the payoff curve shifts with the horizon, and the decision has an undo cost.

Cost bearer

Option A · RAG

Who absorbs the cost

Retrieval and data quality owners

Option B · Fine-Tuning

Who absorbs the cost

ML and evaluation teams
Product teams if tuning cycles are slow

Time horizon

Option A · RAG

Wins when source truth keeps changing and grounding matters.

Option B · Fine-Tuning

Wins when task behavior is stable enough to justify training investment.

Reversibility

What undoing costs

Moderate

What should force a re-look

Trigger conditions that mean the answer may have changed.

Knowledge volatility changes
Task shape stabilizes
Evaluation maturity improves

How to decide

The work you still have to do

The reference can frame the trade-off; only you can weight the factors against your context.

Questions to ask

Open these in the room. Answering them is most of the decision.

Is the problem missing knowledge or missing behavior?
How often does the source truth change?
Do we need citations and source trust?
Can we evaluate behavior quality well enough to justify tuning?

Key factors

The variables that actually move the answer.

Knowledge volatility
Grounding requirements
Behavior specialization needs
Evaluation maturity
Source trust

Evidence needed

What to gather before committing. Not after.

Knowledge volatility assessment
Source trust model
Task evaluation framework
Behavior gap analysis

Signals from the ground

What's usually pushing the call, and what should

On the left, pressures to recognize and discount. On the right, signals that genuinely point toward one option or the other.

What's usually pushing the call

Pressures to recognize and discount.

Common bad reasons

Reasoning that feels convincing in the moment but doesn't hold up.

Fine-tuning is more advanced
RAG is enough for everything
One approach should solve both knowledge and behavior problems

Anti-patterns

Shapes of reasoning to recognize and set aside.

Using fine-tuning to hide bad grounding
Using RAG to solve a behavior problem

What should push the call

Concrete signals that genuinely point to one pole.

For · RAG

Observations that genuinely point to Option A.

Trusted corpus exists
Freshness matters
Citability matters

For · Fine-Tuning

Observations that genuinely point to Option B.

Behavior shift matters more than knowledge freshness
Task is stable and measurable

AI impact

How AI bends this decision

Where AI accelerates the call, where it introduces new distortions, and anything else worth knowing.

AI can help with

Where AI genuinely reduces the cost of making the call.

AI can help synthesize eval sets and compare failure patterns across both approaches.

AI can make worse

Distortions AI introduces that didn't exist before.

This is an AI-native decision; hype around both options distorts judgment fast.

Relationships

Connected decisions

Nearby decisions this is sometimes confused with, adjacent decisions that are often entangled with this one, related failure modes, red flags, and playbooks to reach for.

Easy to confuse with

Nearby decisions and how this one differs.

TD-32 Model API vs Self-Hosted Model

That decision is about where the model runs. This one is about how to shape the knowledge the model brings to the task.
TD-38 Synthetic Evaluation vs Real-World Evaluation

That decision is about how to verify the answer. This one is the architectural choice whose quality that evaluation is trying to measure.
Adjacent concept A prompt-engineering decision

Prompt engineering is how you ask a model for an answer. RAG vs fine-tuning is how the relevant knowledge gets to the model in the first place.