Why measure p95/p99 latency rather than the mean?

Tail latency is what real users experience; a good mean can still hide painful slow requests

Serving Infrastructure & Cost — Semantic Web Academy

A deployment pattern (Level 4) is only half the story; the infra underneath sets whether you hit SLOs and what it costs:

Latency budget: a 200 ms end-to-end SLO must be split across network, feature lookup, and model compute. Measure p95/p99, not the mean — tail latency is what users feel.
Autoscaling & cold starts: scaling to zero saves money but a cold container reloading a 2 GB model adds seconds. Keep a warm floor for latency-critical paths.
Dynamic batching: GPUs are throughput machines. Batching requests for a few milliseconds can 10× throughput at a tiny latency cost — the core trick in Triton / KServe.
Right-sizing: most online models do not need a GPU. CPU with ONNX/quantization is often cheaper and fast enough.

Track cost per 1,000 predictions as a first-class metric. A model that's 0.5% more accurate but 4× the cost rarely wins.