What does recall@k measure?

The fraction of all relevant documents that appear in the top-k results.

The fraction of the top-k results that are relevant.

The fraction of all relevant documents that appear in the top-k results.

The cosine similarity of the top-1 result.

The latency of the retrieval call.

What does recall@k measure?

The fraction of all relevant documents that appear in the top-k results.

The fraction of the top-k results that are relevant.

The fraction of all relevant documents that appear in the top-k results.

The cosine similarity of the top-1 result.

The latency of the retrieval call.

Retrieval metrics — recall@k, MRR, nDCG — Semantic Web Academy

Overview

Evaluating upstream vector search performance using classical information retrieval (IR) heuristics.

Why it matters

Optimizing a RAG system requires isolating retrieval mechanics from generation behavior. By establishing deterministic baselines with recall@k to gauge chunk coverage, Mean Reciprocal Rank (MRR) to penalize positional decay, and nDCG to grade multi-level document relevance, you can scientifically benchmark chunking strategies.

How it actually works

Retrieval metrics grade the search stage in isolation — before generation can mask or compound its errors. They are computed against explicit relevance labels (which docs should be retrieved for a query), not against similarity scores.

relevant = {'doc-2', 'doc-5', 'doc-8'}
ranked = ['doc-7','doc-2','doc-9','doc-5','doc-1']
recall@3 = |{top-3} ∩ relevant| / |relevant|     # coverage of the right docs
MRR      = 1 / rank_of_first_relevant            # how high the first hit sits

Metric	Question it answers	Watch when
recall@k	Did the right docs make the top-k at all?	The answer needs all the evidence
MRR	How early did the first relevant doc appear?	One good hit is enough
nDCG	Are highly-relevant docs ranked above marginal ones?	Relevance is graded, not binary

Recall vs precision is a real trade. Raising k lifts recall but feeds the generator more distractors (hurting faithfulness downstream). The right k is the one that maximises answer quality, found by sweeping k against your golden set.

Treat relevance labels as versioned test assets. Stale or ambiguous labels make the metric lie. Version them with the same rigor as code, because a metric computed on drifting labels is not reproducible — and a non-reproducible metric can't gate a release.

Analogy

Retrieval metrics are grading a search party's map-reading, not their first-aid. You measure whether they reached the right locations (recall) and how quickly (MRR) — separately from whether they treated the patient well (generation). Mix the two and you can't tell which skill failed.

Pitfalls & how to avoid them

Confusing similarity score with relevance. Fix: label relevance explicitly; compute metrics on labels.
Optimising recall blindly. Symptom: more distractors, worse answers. Fix: sweep k against answer quality.
Stale labels. Symptom: metric lies. Fix: version relevance sets like code.
One metric only. Fix: pair recall@k with MRR/nDCG for the full picture.

Apply it to your system

Look at your evaluation set.

›Are your relevance labels versioned, or do they drift silently?
›Which query segment has the worst recall@k, and why?
›How will you detect when an embedding change quietly hurts ranking?

Reading in progress · 0 of 4 activities done