Measure, slice, and gate
Evaluation as a first-class step
mlflow.evaluate() runs a logged model against a labelled dataset and automatically logs metrics, diagnostic plots (ROC, confusion matrix, calibration), and explanations — all attached to the run:
result = mlflow.evaluate(
model='models:/credit-scoring/1',
data=eval_df,
targets='defaulted',
model_type='classifier',
evaluators='default',
)
Validation thresholds catch bad models before promotion
Pair it with validation_thresholds so a model that fails to meet a bar raises rather than silently shipping:
from mlflow.models import MetricThreshold
mlflow.evaluate(
...,
validation_thresholds={
'roc_auc': MetricThreshold(threshold=0.9, greater_is_better=True),
},
baseline_model='models:/credit-scoring/Production',
)
Now CI compares the candidate against the live model and fails the build on regression — the quality gate from the last lesson, but with batteries included.