Faithfulness is high when the generated answer:

Can be fully supported by the retrieved context.

Faithfulness is high when the generated answer:

Can be fully supported by the retrieved context.

Generation metrics — faithfulness, answer relevance, context precision — Semantic Web Academy

Overview

Quantifying downstream LLM output quality and hallucination rates using LLM-as-a-judge frameworks.

Why it matters

When exact string matching fails, we apply algorithmic metrics to cross-examine outputs. We track faithfulness to catch factual contradictions against retrieved context, answer relevance to ensure user intent alignment, and context precision to identify noise pollution within prompt context windows.

How it actually works

Generation metrics grade the answer, given what was retrieved. Because exact-match fails for free text, these use structured checks or an LLM-as-judge — and the two that matter most are faithfulness and context precision.

answer: 'Enterprise refunds over $10k require legal approval.'
context_claims: ['Refunds available for 30 days', 'Refunds over $10k require legal approval']
faithfulness:      every answer claim is supported by context   # catches hallucination
context_precision: 2 chunks retrieved, 1 actually useful         # catches retrieval noise

Measure them together, never alone. Faithfulness without context precision hides a system that retrieves garbage but happens to answer correctly (it'll break the moment retrieval shifts). Context precision without faithfulness hides a system with perfect retrieval that still hallucinates. The pair localises which stage failed.

Track unsupported-claim rate separately from style/relevance scores. Bundling them lets a fluent, well-toned answer mask a factual fabrication — the most dangerous failure to hide. A single rising number ('unsupported claims per 100 answers') is your hallucination smoke alarm.

LLM-judge caveat. Judges are noisy. Calibrate them against human labels on a sample, and add statistical guardrails so a flaky judge score doesn't trip your release gate on noise alone.

Analogy

Generation metrics are two different exam graders: one checks every claim against the open book (faithfulness), the other checks whether the pages you opened were even relevant (context precision). A confident essay full of invented quotes fails the first grader no matter how well it reads.

Pitfalls & how to avoid them

Faithfulness without context precision (or vice-versa). Symptom: hidden failure transfer. Fix: measure both.
Bundling hallucination into a 'quality' score. Fix: track unsupported-claim rate on its own.
Trusting an uncalibrated LLM judge. Fix: calibrate against human labels.
No statistical guardrail. Symptom: flaky scores trip CI. Fix: thresholds with variance bounds.

Apply it to your system

Take one answer your judge scored highly.

›Was every claim actually supported, or did fluency mask a gap?
›Do you track unsupported-claim rate separately from relevance/style?
›How calibrated is your LLM judge against real human labels?

Reading in progress · 0 of 4 activities done