SLOs and the bill are design inputs
The decisions that set your latency and your bill
A deployment pattern (Level 4) is only half the story; the infra underneath sets whether you hit SLOs and what it costs:
- Latency budget: a 200 ms end-to-end SLO must be split across network, feature lookup, and model compute. Measure p95/p99, not the mean — tail latency is what users feel.
- Autoscaling & cold starts: scaling to zero saves money but a cold container reloading a 2 GB model adds seconds. Keep a warm floor for latency-critical paths.
- Dynamic batching: GPUs are throughput machines. Batching requests for a few milliseconds can 10× throughput at a tiny latency cost — the core trick in Triton / KServe.
- Right-sizing: most online models do not need a GPU. CPU with ONNX/quantization is often cheaper and fast enough.
Track cost per 1,000 predictions as a first-class metric. A model that's 0.5% more accurate but 4× the cost rarely wins.