Serving Infrastructure & Cost

Latency budgets, autoscaling cold starts, GPU batching and the bill at the bottom.

0/1 done

SLOs and the bill are design inputs

The decisions that set your latency and your bill

A deployment pattern (Level 4) is only half the story; the infra underneath sets whether you hit SLOs and what it costs:

  • Latency budget: a 200 ms end-to-end SLO must be split across network, feature lookup, and model compute. Measure p95/p99, not the mean — tail latency is what users feel.
  • Autoscaling & cold starts: scaling to zero saves money but a cold container reloading a 2 GB model adds seconds. Keep a warm floor for latency-critical paths.
  • Dynamic batching: GPUs are throughput machines. Batching requests for a few milliseconds can 10× throughput at a tiny latency cost — the core trick in Triton / KServe.
  • Right-sizing: most online models do not need a GPU. CPU with ONNX/quantization is often cheaper and fast enough.

Track cost per 1,000 predictions as a first-class metric. A model that's 0.5% more accurate but 4× the cost rarely wins.

Analogy

Serving infra is restaurant kitchen design. Dynamic batching is cooking ten identical orders in one pan; scale-to-zero is sending the chef home between rushes (and the wait when a late guest arrives); right-sizing is not buying a pizza oven to make toast.

Reflect

Put a number on your serving economics.

  • What is your latency budget, and how is it split across the request path?
  • Does this model actually need a GPU, or is that a habit?
  • What's your cost per 1,000 predictions — and who sees that number?

Reading in progress · 0 of 1 activity done