Model Evaluation in MLOps¶
Introduction¶
Model evaluation is the quality gate between training and deployment. It should be automated, repeatable, and tied to the real cost of wrong predictions.
Why This Matters¶
Accuracy alone can be misleading. Many systems care more about precision, recall, F1, thresholds, segment performance, or business cost.
Core Concepts¶
Core evaluation concepts include validation sets, test sets, confusion matrix, precision, recall, F1, threshold selection, and comparison to the current production model.
Practical Example¶
A script should emit machine-readable metrics:
from sklearn.metrics import classification_report, confusion_matrix
import json
y_true = [0, 1, 1, 0, 1]
y_pred = [0, 1, 0, 0, 1]
report = classification_report(y_true, y_pred, output_dict=True)
matrix = confusion_matrix(y_true, y_pred).tolist()
print(json.dumps({"f1": report["1"]["f1-score"], "confusion_matrix": matrix}, indent=2))
{
"f1": 0.8,
"confusion_matrix": [[2, 0], [1, 2]]
}
How This Fits in a Production Workflow¶
CI/CD can block deployment if metrics fall below the approved threshold. The registry should store the evaluation report with the artifact.
Common Mistakes¶
- Evaluating on training data.
- Optimizing a metric that does not match operational cost.
- Changing thresholds manually after deployment without tracking.
- Ignoring latency and resource cost during evaluation.
Quick Checklist¶
- Is the test set separate from training?
- Are precision, recall, and F1 recorded?
- Is the production threshold documented?
- Are metrics compared to the current model?
- Does CI fail when metrics regress?
Related Guides¶
- How a Model Training Pipeline Works
- Model Registry Explained
- CI/CD Pipeline for Machine Learning Projects
Summary¶
Learn how model evaluation works in MLOps and why metrics must be automated before deployment.