Training/Serving Skew

The number-one silent killer: features computed one way in training, another at serving.

0/2 done

Compute the feature once

When offline ≠ online

Training/serving skew is any difference between the features a model saw in training and the features it sees in production. The model scored 0.94 offline and disappoints live — not because it's bad, but because it's being fed subtly different numbers.

Three classic sources:

  1. Code skew — training computes avg_spend in pandas; serving reimplements it in Java. The two rounding behaviours diverge.
  2. Data skew — training reads a daily batch table; serving reads a real-time stream with different null handling.
  3. Time-travel skew — a feature uses information that wasn't actually available at decision time (label leakage's cousin).

The structural fix is a feature store (Level 1) that computes each feature once and serves the identical transformation to both training and inference. Where you can't, log live features and diff them against the offline distribution.

Analogy

Training/serving skew is rehearsing a play on one stage and performing on another with the furniture moved. The actors (weights) are flawless; they just keep walking into a table that wasn't there in rehearsal.

Reflect

Trace one feature end to end.

  • Is it computed by the same code in training and serving?
  • If not, how would you detect a divergence before users do?
  • Would a feature store eliminate the risk, or just move it?

Reading in progress · 0 of 2 activities done