1 · The Brief — why Swiss German breaks STT, and what an ontology fixes

Understand *why* Swiss German transcription fails, what a lightweight ontology actually does to the signal, and decide the MVP architecture.

0/2 done

Theory — why the signal is hard

The company and the concrete pain

AlpineAssure (fictional but realistic) is a Swiss insurer with call centres in Zürich, Bern and St. Gallen. Claims intake relies on speech-to-text (STT) to pre-fill a claim before an agent confirms it. The pipeline works fine in demos and falls apart on real calls. Symptoms:

  • dialect segments transcribed with low confidence or quietly wrong,
  • policy numbers and amounts missed or malformed,
  • agents spend 6–9 minutes repairing each transcript before claim creation,
  • escalations spike on hailstorm days when call volume surges.

Why Swiss German specifically breaks STT

This is not 'the model is bad'. Swiss German (Schweizerdeutsch) is a structural worst-case for STT, for reasons worth understanding because each one points at a fix:

  1. No standard orthography. Swiss German is spoken, not written — there is no agreed spelling. The same word (Schadensnummer) surfaces as schadensnume, schadensnummeri, schadesnummere. A model trained on written German has no stable target to map these to.
  2. It's a separate spoken language, not an accent. es het inegloffe ('water leaked in') shares almost no surface form with Standard German es ist eingelaufen. The acoustic model often picks the wrong Standard-German word.
  3. Constant code-switching. Callers mix Swiss German, Standard German, French and English (franchise, claim, police) inside one sentence.
  4. Out-of-vocabulary domain terms. Selbstbehalt, Leitungswasserschaden, Elementarschaden are rare in general training data, so they're transcribed phonetically into nonsense.
  5. Numbers are spoken differently. zwöitusig = 2000, föifhundert = 500. Generic STT mangles exactly the tokens a claim depends on.

The lesson: the failures are systematic and domain-shaped, which is precisely what a domain model can correct — without retraining the acoustic model at all.

Theory — the ontology is the accuracy lever

What a lightweight ontology actually does here

An ontology, stripped to its essence, does four jobs: it names the concepts of a domain, fixes a canonical label for each, lists the surface forms that mean the same thing, and states constraints on valid values. You do not need OWL or RDF to do those four jobs — a disciplined JSON file does them too. That is what we mean by a lightweight JSON ontology.

Why it lifts accuracy, mechanically:

  • Domain biasing (before STT). The concept labels become hotwords / an initial_prompt that pushes the decoder toward Leitungswasserschaden instead of a phonetic guess. This raises confidence on exactly the rare terms that matter.
  • Canonicalisation (after STT). A surface-form map collapses leitigswasser, wasserschade, es het inegloffe → the single concept leitungswasserschaden. The transcript can stay messy; the extracted meaning becomes stable.
  • Constraint as a safety net. 'A claim needs a policy number matching ^[A-Z]{2}-\\d{6,8}$ and an incident type from a closed list' lets you reject or route a bad extraction instead of silently writing garbage to the CRM.

So the ontology is not decoration on top of STT — it is the component that converts an unreliable acoustic guess into a trustworthy domain fact. Accuracy that the model can't deliver, the ontology recovers.

Architecture map — JSON MVP + optional KG

Click a node to focus its neighbourhood · drag to pan · scroll to zoom

MVP-first architecture with an optional KG branch

The main path solves today's operational problem: STT → normalise against the lexicon/ontology → extract constrained entities → route by confidence. The KG node is intentionally optional in phase 1 — it earns its place only when cross-entity reasoning becomes a hard requirement (Lesson 4).

Analogy

Think of the ontology as a customs officer with a printed list of declared goods. Travellers (callers) describe the same item in a dozen dialect words; the officer doesn't care how it was phrased — they match it to a line on the official list, reject anything that fits no category, and stamp a clean, standardised entry. The STT engine is just the microphone at the border; the list is what makes the paperwork correct.

Analogy

Or, clinically: this is emergency-room triage. First stabilise the patient (transcription + extraction reliability via the ontology), then plan long-term rehabilitation (a graph-centric semantic architecture). Deploying a knowledge graph before extraction is reliable is operating on a patient who is still bleeding.

Reflect

Before writing any code, name the domain you're actually constraining.

  • List the 10 domain terms your callers say most that generic STT is most likely to mangle — these are your first hotwords.
  • For your highest-value entity (here, policy_number), what is the strict shape a valid value must have, and what should happen when extraction violates it?
  • Which failures are model failures (need better audio/model) versus mapping failures (need a better ontology)? Be honest — most are the latter.

Reading in progress · 0 of 2 activities done