Batch & Streaming Inference at Scale

The same logged model as a Spark UDF over a billion rows, or on a Kafka stream.

0/2 done

Pick the shape, reuse the artifact

One model, three serving shapes

A registry model isn't only a REST endpoint. The same artifact drives three deployment shapes, picked by latency need:

  • Online (REST)mlflow models serve, per-request, ms latency.
  • Batch (Spark) — score a billion rows nightly with a pandas/ Spark UDF; throughput over latency.
  • Streaming — load the pyfunc once per consumer and score each Kafka record as it arrives.

MLflow makes batch trivial with spark_udf:

import mlflow.pyfunc
predict = mlflow.pyfunc.spark_udf(spark, 'models:/credit-scoring/Production')
scored = df.withColumn('pred', predict(*df.columns))

The model is loaded once per executor and broadcast across the partitions — so a billion-row scoring job reuses the exact artifact your REST endpoint serves. Identical predictions, totally different scale envelope.

Analogy

The logged model is a recipe. Online serving is cooking one plate to order; batch with Spark is the same recipe run through an industrial kitchen for 10,000 covers; streaming is the line cook plating each ticket as it prints. One recipe, three kitchens.

Reflect

Match your models to serving shapes.

  • Which of your models is secretly a batch job dressed up as a REST service?
  • Where would streaming inference cut a multi-hour batch latency to seconds?
  • Is the per-request cost of online serving justified, or would nightly batch do?

Reading in progress · 0 of 2 activities done