Compute the feature once
When offline ≠ online
Training/serving skew is any difference between the features a model saw in training and the features it sees in production. The model scored 0.94 offline and disappoints live — not because it's bad, but because it's being fed subtly different numbers.
Three classic sources:
- Code skew — training computes
avg_spendin pandas; serving reimplements it in Java. The two rounding behaviours diverge. - Data skew — training reads a daily batch table; serving reads a real-time stream with different null handling.
- Time-travel skew — a feature uses information that wasn't actually available at decision time (label leakage's cousin).
The structural fix is a feature store (Level 1) that computes each feature once and serves the identical transformation to both training and inference. Where you can't, log live features and diff them against the offline distribution.