Theory — why the signal is hard
The company and the concrete pain
AlpineAssure (fictional but realistic) is a Swiss insurer with call centres in Zürich, Bern and St. Gallen. Claims intake relies on speech-to-text (STT) to pre-fill a claim before an agent confirms it. The pipeline works fine in demos and falls apart on real calls. Symptoms:
- dialect segments transcribed with low confidence or quietly wrong,
- policy numbers and amounts missed or malformed,
- agents spend 6–9 minutes repairing each transcript before claim creation,
- escalations spike on hailstorm days when call volume surges.
Why Swiss German specifically breaks STT
This is not 'the model is bad'. Swiss German (Schweizerdeutsch) is a structural worst-case for STT, for reasons worth understanding because each one points at a fix:
- No standard orthography. Swiss German is spoken, not written — there is no agreed spelling. The same word (Schadensnummer) surfaces as schadensnume, schadensnummeri, schadesnummere. A model trained on written German has no stable target to map these to.
- It's a separate spoken language, not an accent. es het inegloffe ('water leaked in') shares almost no surface form with Standard German es ist eingelaufen. The acoustic model often picks the wrong Standard-German word.
- Constant code-switching. Callers mix Swiss German, Standard German, French and English (franchise, claim, police) inside one sentence.
- Out-of-vocabulary domain terms. Selbstbehalt, Leitungswasserschaden, Elementarschaden are rare in general training data, so they're transcribed phonetically into nonsense.
- Numbers are spoken differently. zwöitusig = 2000, föifhundert = 500. Generic STT mangles exactly the tokens a claim depends on.
The lesson: the failures are systematic and domain-shaped, which is precisely what a domain model can correct — without retraining the acoustic model at all.