Entity F1 for policy_number is 0.93 but Swiss-German WER is 0.28. What's the right read?

This is the ontology working as designed: the transcript is imperfect but claim-critical entities are being canonicalised correctly; keep shipping and keep lowering WER opportunistically.

The pipeline is failing — WER must be under 0.20 or nothing works.

This is the ontology working as designed: the transcript is imperfect but claim-critical entities are being canonicalised correctly; keep shipping and keep lowering WER opportunistically.

Disable normalization to improve WER.

Route every call to manual review until WER drops.

3 · The evaluation loop — measure accuracy where it pays, improve weekly

Measure WER, entity recall, routing precision and review load; turn review-queue failures into ontology updates that compound accuracy over time.

0/2 done

Theory — KPIs that reflect the business, not the transcript

Measure the right thing

The classic mistake is optimising Word Error Rate (WER) and calling it done. Claims operations don't care whether the transcript is beautiful — they care whether the extracted claim is correct and whether the routing decision was safe. So measure four KPIs from day one:

KPI	Target	Why it's the one that matters
WER (Swiss German subset, post-normalisation)	≤ 0.20	Sanity check on the raw signal after the ontology has done its work
Entity F1 (policy_number, amount_chf)	≥ 0.90	The actual product: did we capture the claim correctly?
False auto-process rate	≤ 2%	A bad auto-claim is a compliance event, not just an error
Review-queue SLA	95% within 30 min	The pipeline's job is to shrink human load, measurably

Note that entity F1 can be high even when WER is mediocre — that's the whole point of the ontology. The transcript can mishear filler words as long as it canonicalises the claim-critical tokens. Reporting WER alone would hide your real success.

Theory — the weekly accuracy flywheel

Close the loop: failures become ontology rows

Accuracy is not a launch number, it's a slope. The review queue is your training data — every repaired call tells you exactly which surface form the ontology was missing. The cadence that makes accuracy climb:

Daily — read the top failed phrases from the review queue. Most are a known concept said a new way (es het mer s'velo gglaut → diebstahl).
Weekly — add those surface forms and number forms to the JSON ontology, and promote any new high-frequency term into the STT initial_prompt hotwords. No model retraining — a data edit, reviewed like code.
Biweekly — re-run the frozen golden set; adjust routing thresholds only with evidence, never on a hunch.
Monthly — count the cases that failed for a structural reason (needed to join policy ↔ claim ↔ workshop) — that count is the leading indicator for the phase-2 KG decision in Lesson 4.

This is why JSON-first is powerful: the people who hear the failures (ops) can fix them the same week, because the ontology is a file they're allowed to edit.

Reflect

Most STT programmes optimise transcript beauty; you'll optimise decision quality and queue load.

›Which two failure tags must be mandatory in your review-queue UI so daily triage turns directly into ontology rows?
›At what false-auto-process rate would compliance force every call into manual review — and do you alert before you hit it?
›What's your accuracy *slope* per week, and which artefact (lexicon vs thresholds vs prompt) is moving it most?

Reading in progress · 0 of 2 activities done