In the naïve RAG loop above, which step is responsible for actually placing retrieved text where the LLM can see it?

Augment — inserting the retrieved chunks into the prompt before generation.

Embed — converting the query into a vector.

Augment — inserting the retrieved chunks into the prompt before generation.

Top-k retrieval — ranking chunks by similarity.

Generate — the LLM producing the final text.

An answer is fluent and wrong. Retrieval logs show the correct chunk *was* retrieved. Where is the bug?

In generation/augmentation — the model ignored or wasn't constrained to the context; tighten the grounding instruction.

In retrieval — increase top_k.

In generation/augmentation — the model ignored or wasn't constrained to the context; tighten the grounding instruction.

The naïve RAG loop

Query → embed → top-k → stuff into prompt → generate — the five-step pattern behind every RAG tutorial.

0/4 done

Overview

Query → embed → top-k → stuff into prompt → generate — the five-step pattern behind every RAG tutorial.

Why it matters

Strip away the tooling and every basic RAG system runs the same five-step loop:

Query — the user asks a question in natural language.
Embed — the query is converted into the same vector space as the stored chunks, using the same embedding model that indexed the documents (mixing models here silently breaks retrieval).
Top-k retrieval — the vector store returns the k chunks whose vectors are closest to the query vector (k is often 3-10; too small and you miss context, too large and you drown the LLM in noise and cost).
Augment — this is the step people underestimate: the retrieved chunks are literally pasted into the prompt, usually with instructions like 'answer only using the context below.' The LLM never touches your database directly — it only ever sees whatever text got stuffed into this prompt.
Generate — the LLM produces an answer conditioned on that augmented prompt.

This loop is called naïve precisely because retrieval is a single, flat, one-shot vector search: there's no re-ranking, no query rewriting, no traversal of relationships between chunks, and no verification that the top-k chunks actually contain the answer. Every advanced pattern you'll see later — hybrid retrieval, re-ranking, GraphRAG — is a targeted upgrade to exactly one of these five steps, most often step 3.

How it actually works

Naive RAG is four steps: embed the query → retrieve top-k → augment the prompt → generate. The step most people under-design is augment: how retrieved text is placed into the prompt determines grounding more than the model choice.

def build_prompt(question, chunks):
    context = '\n'.join(f"[{c['id']}] {c['text']}" for c in chunks)
    return (
        'Answer using ONLY the context below. If the answer is not in the '
        'context, say you do not know.\n\n'
        f'Context:\n{context}\n\nQuestion: {question}'
    )

Two contracts, not one. Retrieval and generation are separate subsystems with separate failure modes and separate evaluation. Retrieval can be perfect while generation hallucinates; generation can be faithful while retrieval missed the key chunk. Treat them as two boxes you can test and roll back independently.

The grounding instruction is load-bearing. 'Answer using only this context; if it's absent, say you don't know' converts a confident liar into an honest 'I don't have that'. Pair it with stable chunk ids in the prompt so the model can cite, and so you can trace which retrieved span produced which sentence.

Knowing this loop cold is what lets you see exactly where re-ranking, query rewriting, GraphRAG and guardrails plug in — each is a targeted upgrade to one of these four steps.

Analogy

Retrieve-then-generate is an open-book exam. Retrieval is choosing which pages to keep open; generation is writing the answer. A brilliant student (the LLM) still fails if you opened the wrong pages — and a careless one invents quotes even with the right pages open. You grade the page-picking and the writing separately.

Pitfalls & how to avoid them

No 'only use this context' instruction. Symptom: confident hallucination. Fix: explicit grounding + refusal clause.
Dumping chunks with no ids. Symptom: uncitable answers. Fix: tag each chunk so the model can reference it.
Evaluating end-to-end only. Symptom: you can't tell if retrieval or generation failed. Fix: measure each stage.
Stuffing 30 chunks 'to be safe'. Symptom: lost-in-the-middle, higher cost. Fix: fewer, better, re-ranked chunks.

Apply it to your system

Look at your own prompt-assembly code.

›Does your prompt explicitly forbid answering when the context lacks the answer?
›Can you trace each answer sentence back to a specific retrieved chunk id?
›Could you roll back a generation change without touching retrieval, and vice versa?

Reading in progress · 0 of 4 activities done