当前位置：网站首页>Machine learning concept drift detection method (Apria)

Machine learning concept drift detection method (Apria)

2022-07-04 18:16:00 【Li Guodong】

Insert picture description here

at present , There are many techniques that can be used in machine learning to detect concept drift . Familiarity with these detection methods is the key to using the correct metrics for each drift and model .

In this article , Four types of detection methods are reviewed ： Statistics 、 Statistical process control 、 Based on time window and Context method .

If you are looking for an introduction to concept drift , I suggest you check Concept drift in machine learning One article .

Statistical methods

Statistical methods are used for Compare the differences between distributions .

In some cases , Can use divergence , This is a measure of distance between distributions . In other cases , Run tests to get scores .

Kullback-Leibler The divergence

Kullback-Leibler Divergence is sometimes called relative entropy .

KL Divergence attempts to quantify how different one probability distribution is from another , therefore , If we have distribution Q and P, among ,Q Distribution is the distribution of old data ,P Is the distribution of new data we want to calculate ：

$\displaystyle\sum_x{P(x)}*log(\frac{Q(x)}{P(x)})$

among ,“||” Represents divergence .

We can see ,

If P(x) High and high Q(x) low , Then the divergence will be very high .
If P(x) Low and Q(x) high , Then the divergence will also be very high , But not so big .
If P(x) and Q(x) be similar , Then the divergence will be very low .

Insert picture description here

JS The divergence

Jensen-Shannon Divergence use KL The divergence

$\frac{1}{2}(KL(Q||M) +KL(P||M))$

among , $\frac{Q+P}{2}$ yes P and Q Average between .

JS Divergence and KL The main difference in divergence is JS It's symmetrical , It always has a finite value .

Insert picture description here

Kolmogorov-Smirnov test (K-S test )

Two samples KS Test is a useful and general nonparametric method for comparing two samples . stay KS In the test , We calculated ：
$D_{n,m}=sup_{x}|F_{1,n}(x) - F_{2,m}(x)|$

among , $F_{1,n}(x)$ Is the previous data and $n$ The empirical distribution function of the sample , $F_{2,m}(x)$ It's new data and $m$ Samples and $F_{n}(x) = \frac{1}{n} \displaystyle\sum_{i=1}^n I_{[- \infty,x]}(X_{i})$ Empirical distribution function of , $sup_{x}$ Is to make $F_{1,n}(x) - F_{2,m}(x)|$ Maximize the sample $x$ Subset .

KS The test is sensitive to the difference in the position and shape of the empirical cumulative distribution function of the two samples . It is very suitable for numerical data .

Insert picture description here

When to use statistical methods

The idea of the statistical method part is to evaluate the distribution between two data sets .

We can use these tools to find the differences between data in different time ranges , And measure the differences in data behavior over time .

For these methods , There is no need for labels , No additional memory is required , We can quickly obtain the input characteristics of the model / Indicators of output changes . This will help us to investigate this situation even before there is any potential decline in the performance indicators of the model . On the other hand , If not handled correctly , Lack of labels and neglect of memory of past events and other characteristics may lead to false positives .

Statistical process control

The idea of statistical process control is Verify whether the error of our model is within the controllable range . This is especially important when running in production , Because performance changes over time . therefore , We hope to have a system , If the model reaches a certain error rate , It will send an alert . Please note that , Some models have “ traffic lights ” System , There are also warnings .

Drift detection method / Early drift detection method (DDM/EDDM)

The idea is to model errors as binomial variables . This means that we can calculate our expected error value . When we use binomial distribution , We can mark $= n p t$ , therefore , $\sigma = \sqrt{\frac{p_{t}(1-p_{t})}{n}}$ .

DDM

Here we can propose ：

When $p_{t}+\sigma_{t}\ge p_{min} +2\sigma_{min}$ Give a warning
When $p_{t}+\sigma_{t}\ge p_{min} +3\sigma_{min}$ Call the police

advantage ：DDM Gradually change in detection （ If they are not very slow ） And sudden changes （ Increment and sudden drift ） It shows good performance .

shortcoming ： When the change is slow ,DDM Difficult to detect drift . Many samples may have been stored for a long time before the drift level is activated , There is a risk of sample storage overflow .

EDDM

ad locum , By measuring 2 A continuous wrong distance , We can propose ：

When $\frac{p_{t}+2{\large \sigma}_{t}}{p_{max}+2{\large \sigma}_{max}}<{\Large \alpha}$ Give a warning
When $\frac{p_{t}+2{\large \sigma}_{t}}{p_{max}+2{\large \sigma}_{max}}<{\Large \beta}$ Give an alarm when , among ${\Large \beta} $ Usually it is 0.9

EDDM The method is DDM Modified version of , Its focus is to identify gradual drift .

Insert picture description here

CUMSUM and Page-Hinckley (PH)

CUSUM And its variants Page-Hinckley (PH) It is one of the development methods in the community . The idea of this method is to provide a sequence analysis technology , This technology is usually used to monitor the change of the average value of Gaussian signal .

CUSUM and Page-Hinckley (PH) Concept drift is detected by calculating the difference between the observed value and the average value , And set the drift alarm when the value is greater than the user-defined threshold . These algorithms are sensitive to parameter values , This leads to a trade-off between false positives and detection of true drift .

because CUMSUM and Page-Hinckley (PH) Used to process data streams , So each event is used to calculate the next result ：

CUMSUM：

${\large g}_{0}=0, {\large g}_{t}= max(0, {\large g}_{t-1}+{\large \varepsilon}_{t}-{\large v})$ among , g On behalf of the event , Or for the purpose of drift , Model input / Output
When ${\large g}_{t}>h$ sound the alarm , And set up ${\large g}_{t}=0$
$h, v$ Is an adjustable parameter

Be careful ：CUMSUM It's memoryless 、 Unilateral or asymmetrical , Therefore, it can only detect the increase of value .

Page-Hinckley (PH) ：

${\large g}_{0}=0, {\large g}_{t}= {\large g}_{t-1}+({\large \varepsilon}_{t}-v)$
$G_{t}=min({\large g}_{t},G_{t-1})$
When $g_{t}-G_{t}>h$ sound the alarm , And set up $g_{t}=0$ .

Insert picture description here

When to use statistical process control methods

Use the statistical process control method introduced , We need to provide the label of the sample . in many instances , It could be a challenge , Because the delay may be high , And it's hard to extract it , Especially when it is used in large organizations . On the other hand , Once these data are obtained , We will get a relatively fast system to cover 3 There are three types of drift ： Sudden drift 、 Progressive drift and incremental drift .

The system also allows us to track degradation with the Department （ If any ）, To give warnings and alarms .

Time window distribution

Time window distribution model Watch for timestamps and events .

ADWIN

ADWIN The thought is from the time window $W$ Start , Dynamically enlarge the window when the context does not change significantly $W$ , And shrink it when changes are detected . The algorithm tries to find the $W - w_{0}$ and $w_{1}$ Two sub windows of . This means the old part of the window $w_{0}$ It is based on data distribution different from the actual , So it's deleted .

Insert picture description here

Paired Learners

Suppose for a given problem , We have a large stable model trained with a large amount of data , Let's mark it as a model A.

We will also design another model , A more lightweight model , Train on smaller and newer data （ It can have the same type ）. We call it a model B.

idea ： Find the model B Better than the model A Time window of . Due to the model A Compared to the model B Stabilize and encapsulate more data , We expect it to outperform it . however , If the model B Better than the model A, It may indicate a conceptual drift .

Insert picture description here

Context method （Contextual Approaches）

The idea of these methods is to evaluate the difference between the training set and the test set . When the difference is significant , It may indicate that there is a drift in the data .

Tree features

The idea of tree features is Train a relatively simple tree on data , And add the prediction timestamp as one of the features . Because the tree model can also be used for feature importance , We can know how time affects data and when . Besides , We can see the split created by timestamp , We can see the difference between the concepts before and after the split .

Insert picture description here

In the diagram above , We can see that the date feature is at the root , This means that this feature has the highest information gain , This means that 7 month 22 Japan , They may have drifted in the data .