Overview
Hand-curated Q&A pairs + LLM-generated test sets — and how to tell them apart.
Why it matters
A golden set is the ground truth your CI runs against. Build it once, version it like code, and every model change becomes measurable.
Where this sits in the stack
Golden datasets and synthetic eval is one of the load-bearing decisions in a KG/RAG/agent system: choices made here propagate to retrieval quality, agent reliability, cost per query, and the on-call burden of whoever ships it. Teams that name this trade-off explicitly ship faster than teams that leave it implicit.