Why pair human-curated golden examples with LLM-generated synthetic ones?

Humans give precision on tricky cases; synthetics give coverage cheaply.

Synthetic examples are always more reliable.

Human examples are required by the LangGraph spec.

Synthetic examples don't need labels.

Why pair human-curated golden examples with LLM-generated synthetic ones?

Humans give precision on tricky cases; synthetics give cheap, broad coverage.

Synthetic examples are always more reliable than human ones.

Human examples are required by the LangGraph spec.

Synthetic examples never need labels.

Golden datasets and synthetic eval — Semantic Web Academy

Overview

Curating immutable evaluation baselines and scaling verification via programmatic test generation.

Why it matters

A reliable evaluation suite requires data rigor. We balance manually curated, edge-case heavy 'golden datasets' built by domain experts with high-volume, synthetically generated question-context pairs (e.g., via evolutions) to maintain cheap, exhaustive coverage across vast document corpora.

How it actually works

A golden set is your versioned, reproducible test suite for retrieval+generation. The best items specify more than the expected answer text — they pin the evidence that must be retrieved.

{
  "id": "refund-enterprise-001",
  "question": "Can an enterprise customer get a $12k refund?",
  "must_retrieve": ["refund-policy", "enterprise-exception"],
  "expected_answer_facts": ["30-day window", "over $10k needs legal approval"],
  "forbidden_claims": ["all refunds are automatic"]
}

Why must_retrieve matters. An item that only checks the answer text can pass by luck even when retrieval silently regressed — the model guessed right this time. Pinning the required evidence ids turns a flaky end-to-end check into a precise one that catches retrieval drift before it becomes a wrong answer.

Blend human + synthetic. Hand-curated items (built by domain experts) cover the tricky edge cases that matter; LLM-generated synthetic items give cheap breadth across the corpus. Humans buy precision on the cases you fear; synthesis buys coverage you couldn't afford manually.

Include adversarial 'should refuse' items. A golden set that only contains answerable questions never tests your refusal behaviour. Add questions with insufficient context whose correct answer is 'I don't know' — that's how you keep honesty from regressing.

Analogy

A golden set is a flight simulator's scenario library. It's not enough to practise smooth landings — you load engine failures and storms (adversarial cases) on purpose, and you check the pilot used the right instruments (must_retrieve), not just that the plane happened to land.

Pitfalls & how to avoid them

Answer-only items. Symptom: pass by luck despite retrieval drift. Fix: pin must_retrieve evidence.
No adversarial/refusal cases. Symptom: honesty regresses unseen. Fix: add 'should refuse' items.
Unversioned golden set. Symptom: irreproducible runs. Fix: version it like code.
Synthetic-only. Symptom: misses real edge cases. Fix: blend with human-curated items.

Apply it to your system

Review your current test cases.

›Do your items pin the evidence that must be retrieved, or only the answer?
›How many 'correct answer is: I don't know' cases do you have?
›Where would synthetic generation cheaply expand coverage you lack today?

Reading in progress · 0 of 4 activities done