
No Bad Questions About ML
Definition of Data drift
What is data drift?
Data drift happens when the data a machine learning model sees in the real world starts to look different from the data it was trained on. When this shift happens, the model's predictions can become less accurate.
For example, a model trained on customer behavior from one region may not work well in another region, or a model trained on last year's data may perform poorly if patterns change this year.
To deal with data drift, models need to be monitored regularly and updated with new data. Sometimes this means retraining the model, and sometimes it means using techniques that let the model adapt as the data changes.
Retraining models can be expensive and time-consuming, but semi-supervised learning bridges the gap between supervised and unsupervised methods to make the process more efficient. Discover how companies like Google and PayPal use this technique to optimize their ML workflows.
Concept drift vs model drift vs data drift: what is the difference?
Data drift, concept drift, and model drift all describe ways a machine learning model's performance can decline, but they differ in what exactly is changing:
DATA DRIFT
- What it is: A change in the input data's distribution. Essentially, the data features that the model sees in production differ from what it was trained on.
- Why it matters: Because the model was built on past patterns, it may not perform well when those patterns shift.
CONCEPT DRIFT
- What it is: A change in the relationship between input features and the target output. In other words, even if the input data distribution hasn't changed, what you're predicting has (or how inputs map to outputs has evolved).
- Why it matters: The model's learned logic becomes outdated, focusing on patterns that no longer apply.
MODEL DRIFT
- What it is: A general term describing any decline in a model's performance over time. It's an umbrella that can include both data drift and concept drift.
- Why it matters: Model drift signals that the model's predictions are becoming unreliable due to changing data or changing input-output dynamics.
In summary:
Data drift is about shifted data. Concept drift is about shifting meaning. And model drift is about degrading performance (caused by either).
How to detect data drift?
Detecting data drift means comparing the data your model is currently seeing with the data it was originally trained on. The goal is to spot changes in distributions before they harm model accuracy. Different types of drift require different detection methods:
1. COVARIATE DRIFT
What changes: The distribution of input features.
Example: Customer income ranges in the new data are higher than those in the training data.
How to detect:
- Compare training vs. production feature distributions using statistical tests (for example, Kolmogorov–Smirnov, Chi-square) or metrics like KL-divergence and PSI.
- Monitor summary statistics such as mean, median, or variance of key features.
2. PRIOR PROBABILITY DRIFT (LABEL DRIFT)
What changes: The distribution of output labels.
Example: A fraud detection model trained on 2% fraud cases suddenly sees 10% fraud in new data.
How to detect:
- Track class frequencies in predictions vs. training data.
- Use Chi-square tests or PSI to check if label proportions shift over time.
3. CONCEPT DRIFT
What changes: The relationship between inputs and outputs.
Example: The same email patterns no longer indicate spam because spammers have changed their tactics.
How to detect:
- Monitor model performance metrics (accuracy, precision, recall, AUC). A sudden drop often signals concept drift.
- Use a "shadow model" trained on recent data and compare its performance to the original model.
How to handle data drift?
Once you detect drift, the goal is to keep your model aligned with the changing data. There are three main ways to do this:
1. Retrain the model
Update the model with fresh data so it learns the new patterns.
- Rolling window retraining: Use only the most recent data, discarding old information.
- Scheduled retraining: Refresh the model at fixed intervals (weekly, monthly, etc.).
- Online learning: Continuously update the model as new data arrives.
2. Adapt the model
Instead of starting over, adjust how the model uses data. Here are three methods:
- Ensembles: Keep multiple models and switch to the one that performs best on current data.
- Weighted sampling: Give more importance to recent data during training.
- Calibration/correction: Adjust predictions or features to match new distributions.
3. Monitor continuously
Handling drift isn't one-and-done — you need ongoing checks.
- Automated alerts when drift is detected.
- Dashboards to track feature distributions and performance.
- Confidence monitoring to spot when predictions become less certain.
What are the key challenges of avoiding data drift?
Avoiding data drift sounds simple, but in practice, it's one of the hardest parts of running machine learning in production. The main challenges include:
- Constantly changing real-world data — User behavior, markets, environments, and populations evolve all the time. Even small shifts in demographics, seasonality, or external events can cause drift that's difficult to anticipate.
- Limited visibility into data pipelines — Data often comes from multiple sources, and changes in collection methods, sensors, or upstream systems may go unnoticed until model performance drops.
- Detecting drift early — Not all drift is obvious. Some shifts are subtle and gradual, making it hard to catch them before they harm performance. Choosing the right monitoring metrics and thresholds is tricky.
- Lack of labeled data for monitoring — Many real-world systems don't generate immediate ground-truth labels (for example, fraud detection may take weeks to confirm). Without labels, it's harder to tell if drift is hurting results.
- Deciding when and how to retrain — Retraining too often wastes resources, while retraining too late hurts performance. Striking the right balance and knowing which data to include is a major challenge.
- Resource and operational costs — Continuous monitoring, retraining, and validation require infrastructure, data storage, and skilled teams, which not all organizations can afford.
Key Takeaways
- Data drift occurs when real-world data no longer matches the data a model was trained on, which can quickly reduce accuracy if not addressed.
- It differs from concept drift, where the relationship between inputs and outputs changes, and from model drift, which refers to overall performance decline caused by either.
- Detecting drift requires comparing training data with live data through statistical tests, monitoring label distributions, and tracking performance metrics.
- Handling drift usually involves retraining models with fresh data, adapting them to emphasize recent patterns, and setting up continuous monitoring to catch issues early.
- The main challenges lie in the unpredictability of real-world data, limited visibility into data pipelines, delays in obtaining labels, and the costs of frequent retraining and monitoring.