Match the tool to the on-call
A pragmatic comparison
| Tool | Strength | Watch-out |
|---|---|---|
| KServe | Kubernetes-native, model-mesh for many small models, standard InferenceService CRD. | Steep curve; needs a working K8s. |
| BentoML | Python-first DX, batteries-included REST/gRPC, easy to package. | Less rich on multi-model serving. |
| Seldon Core | Inference graphs, A/B routing built-in, mature for finance/health. | More complex CRDs. |
| NVIDIA Triton | Top GPU throughput, batching, multi-framework. | Operationally heavier; best ROI on GPU. |
| Vendor (SageMaker / Vertex / Azure ML) | Managed, autoscaling, monitoring built in. | Vendor lock-in; cost at scale. |
Don't pick the cool tool — pick the one your platform team can support on-call at 3 AM.