Data Drift Detection, a part of Continuous Model Monitoring
Co-variate drift detection (often called as Data Drift Detection) plays an important role in determining whether the features used for model training has been changed in production by some external forces which may affect the model’s performance. If the drift detection gives “red signal ~ Drift Detected”, it may be the right time to perform necessary steps to know, whether to re-train the model or not. One possible approach is to ask auditors to provide synthetic ground truths to confirm the analysis.
As discussed, it’s suggested to compare the production data with training data. This comparison can be done in multiple ways, we will discuss the potential drift detection methods which is being used in the industry and compare them to find the best method to use-
- Detecting change in Co-variate’s attribution —
The idea here is to detect change in feature’s importance. In machine learning, co-variate is also known as feature. If there is any shift in the feature’s importance in production data from training data, then we call it Co-variate’s attribution drift.
Feature importance ~ It is a score between 0 and 1 assigned to each column which indicates the variability explained by the specific feature in the model. Higher the value, higher is the effect of the feature on the model.
There are many ways to find the feature’s importance. Some common technique is by using “Random Forest’s feature importance” or “Shapley values”.
Note: It’s statistically proven that change in the feature’s distribution increases the error rate of the model but not specifically change in feature’s attribution. So, feature’s attribution can’t detect the co-variate drift as it doesn’t work on the principle of finding change in feature’s distribution.
2. Detecting change in Data Statistic —
The word “Statistic” can be understood as a singular value calculated over a sample which describes the data. Some examples of statistic are mean, mode, median, minimum, maximum etc.
It’s a simple and easy technique where we need to keep track of various statistic of co-variate to ensure these statistic don’t deviate much from training data.
“Everything comes at a cost”, even though the technique is pretty simple but there are several disadvantages which comes with it. Let’s discuss some of them briefly —
a) Keeping track of all these statistic is hectic and cumbersome. For a single co-variate, suppose there are 5–6 statistic and if we have large set of co-variates in the model, it will be next to impossible to keep an eye on all of it and conclude anything.
b) If there is some deviation in any of the statistic, let’s say in co-variate 1 (say C1) mean is 76 in production time and was 70 in training time. Then, how can we define whether the difference/deviation is significant or not to call it as drift.
3. Detecting Drift in co-variate’s data distribution —
Data Distribution ~ In simple words, it describes how the data has been distributed, like which values are frequent, which are rare. In the below graph, we can see the distribution of test scores showing score “1150” is most frequent where we can see the peak. This bell curved shaped distribution is also known as Normal distribution.
These distributions constitutes more or less all the information about the data. So, in order to check whether features have been changed from training to production, it’s a good idea to compare their distributions. If feature’s training data distribution is significantly different from productions’ then we call it as Co-variate Drift.
Here the word “significant” plays an important role, as difference between the distributions can be large or small but when do we call it as significant? Here comes the introduction of Statistical Hypothesis Tests.
Statistical Hypothesis Tests ~ It is a part of Statistical lnference, where we use sample data to test a particular hypothesis and draw conclusion about the population. We define these statistical hypothesis as —
(i). “Null hypothesis” which proposes that no statistical significance exists between two set of observations ( in our case there are 2 distributions) and test the credibility of it using sample data.
(ii). “Alternate hypothesis” which proposes the opposite of “Null hypothesis”.
There are many statistical tests that can be used to detect drift between 2 distributions namely — KS test, Chi-Square test, PSI test, JS-Divergence etc.
In the next article, we will talk about the optimum way to detect Co-variate drift using Statistical test and discuss the above mentioned tests in detail.