Data Drift
Intro
Data drift is when the statistical properties of your input data change over time compared to the data your model was trained on. It matters because ML models assume training and serving data come from the same distribution — when that stops being true, predictions can become less reliable without any obvious error. A fraud model trained on last year's purchase behavior silently degrades when new payment methods emerge; a vision model deployed to a new camera produces worse results due to different lighting.
Types of Drift
| Type | What changes | Example |
|---|---|---|
| Data drift (feature drift) | P(X) — input distribution | Users start asking questions in a new language |
| Label drift (prior probability shift) | P(Y) — label distribution | Fraud rate increases from 1% to 5% |
| Concept drift | P(Y|X) — the relationship between inputs and labels | "Spam" patterns change as spammers adapt |
| Covariate shift | P(X) changes but P(Y|X) stays the same | New user segment with different demographics |
Concept drift is the most dangerous — the model's learned relationship is no longer valid, so retraining on new data is the only fix. Data drift may be benign if the model generalizes well to the new distribution.
Detection Methods
Population Stability Index (PSI) — measures how much a feature's distribution has shifted. Commonly used in credit scoring and finance.
import numpy as np
def psi(expected, actual, bins=10, eps=1e-6):
"""Compute PSI between baseline and current feature distributions."""
quantiles = np.quantile(expected, np.linspace(0, 1, bins + 1))
edges = np.unique(quantiles)
if len(edges) < 3:
return 0.0
exp_counts, _ = np.histogram(expected, bins=edges)
act_counts, _ = np.histogram(actual, bins=edges)
exp_p = np.maximum(exp_counts / max(exp_counts.sum(), 1), eps)
act_p = np.maximum(act_counts / max(act_counts.sum(), 1), eps)
return float(np.sum((act_p - exp_p) * np.log(act_p / exp_p)))
# PSI < 0.1: no significant drift
# PSI 0.1–0.2: moderate drift, investigate
# PSI > 0.2: significant drift, action required
Kolmogorov-Smirnov (KS) test — non-parametric test for numeric features. Tests whether two samples come from the same distribution.
Chi-square test — for categorical features. Tests whether observed frequencies match expected frequencies.
Jensen-Shannon divergence — symmetric measure of distribution distance. Bounded [0, 1], easier to interpret than KL divergence.
Monitoring Workflow
1. Define baseline
└── Training data distribution OR last 30 days of stable serving
2. Compute drift metrics per feature
└── PSI for numeric, chi-square for categorical
└── Run daily or per batch
3. Segment monitoring
└── Break down by region, device, user tier
└── Averages hide drift in subpopulations
4. Alert on threshold breach
└── PSI > 0.2, KS p-value < 0.05
5. Investigate
└── Rule out pipeline issues first (schema changes, ETL bugs, encoding changes)
└── Check model performance if labels are available
6. Respond
└── Retrain on recent data
└── Update feature engineering
└── Adjust decision thresholds
└── Route to manual review for high-risk cases
Pitfalls
Drift without performance drop
Drift in a feature the model does not rely on heavily may not affect predictions. Always check model performance metrics (if labels are available) before triggering a retrain. Unnecessary retraining wastes resources and can introduce instability.
Averages hiding drift
A global PSI of 0.05 (no drift) can coexist with PSI of 0.4 for a specific user segment. Always monitor drift per segment.
Delayed labels
For many production systems, ground truth labels arrive days or weeks after prediction (e.g., fraud confirmed after investigation). Use proxy metrics (escalation rate, user complaints) for early drift detection while waiting for labels.
Treating all drift as concept drift
Data drift (P(X) changes) does not always require retraining — the model may generalize. Concept drift (P(Y|X) changes) always requires retraining. Distinguish between them before deciding on a response.
Tradeoffs
Detection Method Selection
| Method | Feature type | Sensitivity | Interpretability | Use when |
|---|---|---|---|---|
| PSI | Numeric | Medium | High (thresholds: 0.1, 0.2) | Credit scoring, finance; well-understood thresholds |
| KS test | Numeric | High | Medium (p-value) | General numeric features; sensitive to small shifts |
| Chi-square | Categorical | Medium | Medium | Categorical features with stable cardinality |
| Jensen-Shannon divergence | Any | High | Low (0–1 scale) | Comparing distributions symmetrically; bounded output |
| Model performance metrics | Any | Highest | High | When labels are available; most direct signal |
Decision rule: use PSI for numeric features in regulated domains (finance, healthcare) where thresholds are well-established. Use KS test for general numeric monitoring. Use model performance metrics when labels are available — they are the most direct signal. Use proxy metrics (escalation rate, confidence distributions) when labels are delayed.
Retraining Strategy
| Strategy | Trigger | Cost | Risk | Use when |
|---|---|---|---|---|
| Scheduled retraining | Time-based (weekly, monthly) | Predictable | May retrain unnecessarily | Stable domains with predictable drift cycles |
| Drift-triggered retraining | PSI/KS threshold breach | Variable | May miss slow drift | Domains with irregular drift patterns |
| Continuous learning | Every new batch | High | Catastrophic forgetting | High-velocity data streams with fast-changing patterns |
| Manual review + retrain | Human decision | Low (infrequent) | Slow response | Low-volume, high-stakes models where retraining is expensive |
Decision rule: start with scheduled retraining (weekly or monthly) for most models. Add drift-triggered alerts as a safety net. Move to drift-triggered retraining only when scheduled retraining is too slow to respond to real-world changes.
Questions
Data drift: the input distribution P(X) changes (users ask different questions, new product categories appear). The model's learned relationship P(Y|X) may still be valid.
Concept drift: the relationship P(Y|X) changes (what constitutes spam evolves, fraud patterns shift). The model's learned relationship is no longer valid — retraining is required.
Cost of confusing them: retraining on new data for concept drift is necessary; retraining for benign data drift wastes resources and may reduce performance on the original distribution.
Use proxy metrics: escalation rate, user complaints, re-contact rate, or model confidence distributions. Monitor input feature distributions (PSI, KS test) as an early warning signal. When labels arrive, compute actual performance metrics and compare to baseline.
References
- Data drift in machine learning models (Evidently AI) — practitioner guide to drift types, detection methods, and monitoring workflows with Python examples.
- Population Stability Index (PSI) explained — detailed explanation of PSI calculation, interpretation thresholds, and use in credit scoring.
- Monitoring ML models in production (Google MLOps) — Google's MLOps guide covering data validation, model monitoring, and retraining triggers.
- Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift (Rabanser et al., 2019) — empirical comparison of drift detection methods across different shift types and dataset sizes.