Utilisation is a discipline
Three knobs
- Right-size the request — asking for 4 GPUs when you use 1 wastes 75% capacity. Profile first.
- Use a scheduler with bin-packing — Kubernetes + KubeFlow, Ray, Slurm. Don't 'hand out' GPUs by Slack DM.
- Time-share with MIG / fractional GPUs — A100/H100 can be sliced; many inference workloads fit in 1/7th of a GPU.
Cheap wins
- Spot/preemptible instances for non-urgent training (50–80% saving).
- Checkpoint every N steps — losing 12 hours to a preemption hurts only once.
- Profile before buying bigger hardware.