Semantic caching is most effective when:

Many distinct queries map to the same answerable intent.

Queries are short and repeat exactly.

Many distinct queries map to the same answerable intent.

The retrieval index changes every minute.

There is no embedding model available.

Semantic caching is *most* effective when:

Many distinctly-worded queries map to the same answerable intent.

Queries are short and repeat the exact same string.

Many distinctly-worded queries map to the same answerable intent.

The retrieval index changes every minute.

There is no embedding model available.

Caching & cost ceilings — Semantic Web Academy

Overview

Semantic caching, prompt caching, hard per-tenant cost ceilings.

Why it matters

LLM cost grows superlinearly with traffic. Cache aggressively, cap per-tenant spend, and page on cost anomalies the same way you page on errors.

How it actually works

LLM cost grows with traffic and prompt size, so production systems cache aggressively and cap spend explicitly. Two cache layers and one budget do most of the work.

semantic_cache: { key: 'embedding(normalized_question)', hit_threshold: 0.94,
                  invalidate_on: [policy_version_change, index_rebuild] }
tenant_budget:  { daily_usd: 25, on_exhaustion: 'retrieval-only answer with citation' }
alert: 'cost_per_answer p95 > $0.08 for 15m'

Semantic caching keys on the meaning of a question, not its exact string, so many phrasings of the same intent share one cached answer. It pays off precisely when distinct wordings map to the same answerable intent — a support bot where everyone asks the refund question differently.

The dangerous bug is stale cache. A cached answer must be invalidated when the underlying policy version changes or the index is rebuilt, or you'll serve confidently-wrong cached answers long after the source was fixed. Tie cache invalidation to the same events that drive reindexing.

Define the degraded mode before you hit the ceiling. When a tenant's budget is exhausted, decide now what happens — fall back to a retrieval-only answer with citations, queue, or refuse — so cost control is a designed behaviour, not a 2 a.m. surprise on the invoice.

Analogy

Caching and cost ceilings are a prepaid utility meter. The meter (budget) cuts to a low-power mode when credit runs out instead of sending a shock bill, and you don't keep serving last year's tariff (stale cache) after prices changed.

Pitfalls & how to avoid them

Exact-string cache only. Symptom: low hit rate. Fix: semantic cache on normalised intent.
No cache invalidation. Symptom: stale answers served. Fix: invalidate on policy/index change.
No degraded mode. Symptom: hard failure at budget exhaustion. Fix: define the fallback up front.
No cost alarm. Fix: page on cost-per-answer anomalies like you page on errors.

Apply it to your system

Estimate your cost surface.

›Which user intents are asked many different ways and would cache well?
›What events must invalidate a cached answer in your domain?
›What is the right degraded behaviour when a tenant hits their budget?

Reading in progress · 0 of 4 activities done