当前位置:网站首页>Machine learning concept drift detection method (Apria)
Machine learning concept drift detection method (Apria)
2022-07-04 18:16:00 【Li Guodong】
at present , There are many techniques that can be used in machine learning to detect concept drift . Familiarity with these detection methods is the key to using the correct metrics for each drift and model .
In this article , Four types of detection methods are reviewed : Statistics 、 Statistical process control 、 Based on time window and Context method .
If you are looking for an introduction to concept drift , I suggest you check Concept drift in machine learning One article .
Statistical methods
Statistical methods are used for Compare the differences between distributions .
In some cases , Can use divergence , This is a measure of distance between distributions . In other cases , Run tests to get scores .
Kullback-Leibler The divergence
Kullback-Leibler Divergence is sometimes called relative entropy .
KL Divergence attempts to quantify how different one probability distribution is from another , therefore , If we have distribution Q and P, among ,Q Distribution is the distribution of old data ,P Is the distribution of new data we want to calculate :
K L ( Q ∣ ∣ P ) = − ∑ x P ( x ) ∗ l o g ( Q ( x ) P ( x ) ) KL(Q||P) = - \displaystyle\sum_x{P(x)}*log(\frac{Q(x)}{P(x)}) KL(Q∣∣P)=−x∑P(x)∗log(P(x)Q(x))
among ,“||” Represents divergence .
We can see ,
- If P(x) High and high Q(x) low , Then the divergence will be very high .
- If P(x) Low and Q(x) high , Then the divergence will also be very high , But not so big .
- If P(x) and Q(x) be similar , Then the divergence will be very low .
JS The divergence
Jensen-Shannon Divergence use KL The divergence
J S ( Q ∣ ∣ P ) = 1 2 ( K L ( Q ∣ ∣ M ) + K L ( P ∣ ∣ M ) ) JS(Q||P) = \frac{1}{2}(KL(Q||M) +KL(P||M)) JS(Q∣∣P)=21(KL(Q∣∣M)+KL(P∣∣M))
among , M = Q + P 2 M = \frac{Q+P}{2} M=2Q+P yes P and Q Average between .
JS Divergence and KL The main difference in divergence is JS It's symmetrical , It always has a finite value .
Kolmogorov-Smirnov test (K-S test )
Two samples KS Test is a useful and general nonparametric method for comparing two samples . stay KS In the test , We calculated :
D n , m = s u p x ∣ F 1 , n ( x ) − F 2 , m ( x ) ∣ D_{n,m}=sup_{x}|F_{1,n}(x) - F_{2,m}(x)| Dn,m=supx∣F1,n(x)−F2,m(x)∣
among , F 1 , n ( x ) F_{1,n}(x) F1,n(x) Is the previous data and n n n The empirical distribution function of the sample , F 2 , m ( x ) F_{2,m}(x) F2,m(x) It's new data and m m m Samples and F n ( x ) = 1 n ∑ i = 1 n I [ − ∞ , x ] ( X i ) F_{n}(x) = \frac{1}{n} \displaystyle\sum_{i=1}^n I_{[- \infty,x]}(X_{i}) Fn(x)=n1i=1∑nI[−∞,x](Xi) Empirical distribution function of , s u p x sup_{x} supx Is to make ∣ F 1 , n ( x ) − F 2 , m ( x ) ∣ |F_{1,n}(x) - F_{2,m}(x)| ∣F1,n(x)−F2,m(x)∣ Maximize the sample x x x Subset .
KS The test is sensitive to the difference in the position and shape of the empirical cumulative distribution function of the two samples . It is very suitable for numerical data .
When to use statistical methods
The idea of the statistical method part is to evaluate the distribution between two data sets .
We can use these tools to find the differences between data in different time ranges , And measure the differences in data behavior over time .
For these methods , There is no need for labels , No additional memory is required , We can quickly obtain the input characteristics of the model / Indicators of output changes . This will help us to investigate this situation even before there is any potential decline in the performance indicators of the model . On the other hand , If not handled correctly , Lack of labels and neglect of memory of past events and other characteristics may lead to false positives .
Statistical process control
The idea of statistical process control is Verify whether the error of our model is within the controllable range . This is especially important when running in production , Because performance changes over time . therefore , We hope to have a system , If the model reaches a certain error rate , It will send an alert . Please note that , Some models have “ traffic lights ” System , There are also warnings .
Drift detection method / Early drift detection method (DDM/EDDM)
The idea is to model errors as binomial variables . This means that we can calculate our expected error value . When we use binomial distribution , We can mark = n p t =npt =npt, therefore , σ = p t ( 1 − p t ) n \sigma = \sqrt{\frac{p_{t}(1-p_{t})}{n}} σ=npt(1−pt).
DDM
Here we can propose :
- When p t + σ t ≥ p m i n + 2 σ m i n p_{t}+\sigma_{t}\ge p_{min} +2\sigma_{min} pt+σt≥pmin+2σmin Give a warning
- When p t + σ t ≥ p m i n + 3 σ m i n p_{t}+\sigma_{t}\ge p_{min} +3\sigma_{min} pt+σt≥pmin+3σmin Call the police
advantage :DDM Gradually change in detection ( If they are not very slow ) And sudden changes ( Increment and sudden drift ) It shows good performance .
shortcoming : When the change is slow ,DDM Difficult to detect drift . Many samples may have been stored for a long time before the drift level is activated , There is a risk of sample storage overflow .
EDDM
ad locum , By measuring 2 A continuous wrong distance , We can propose :
- When p t + 2 σ t p m a x + 2 σ m a x < α \frac{p_{t}+2{\large \sigma}_{t}}{p_{max}+2{\large \sigma}_{max}}<{\Large \alpha} pmax+2σmaxpt+2σt<α Give a warning
- When p t + 2 σ t p m a x + 2 σ m a x < β \frac{p_{t}+2{\large \sigma}_{t}}{p_{max}+2{\large \sigma}_{max}}<{\Large \beta} pmax+2σmaxpt+2σt<β Give an alarm when , among ${\Large \beta} $ Usually it is 0.9
EDDM The method is DDM Modified version of , Its focus is to identify gradual drift .
CUMSUM and Page-Hinckley (PH)
CUSUM And its variants Page-Hinckley (PH) It is one of the development methods in the community . The idea of this method is to provide a sequence analysis technology , This technology is usually used to monitor the change of the average value of Gaussian signal .
CUSUM and Page-Hinckley (PH) Concept drift is detected by calculating the difference between the observed value and the average value , And set the drift alarm when the value is greater than the user-defined threshold . These algorithms are sensitive to parameter values , This leads to a trade-off between false positives and detection of true drift .
because CUMSUM and Page-Hinckley (PH) Used to process data streams , So each event is used to calculate the next result :
CUMSUM:
- g 0 = 0 , g t = m a x ( 0 , g t − 1 + ε t − v ) {\large g}_{0}=0, {\large g}_{t}= max(0, {\large g}_{t-1}+{\large \varepsilon}_{t}-{\large v}) g0=0,gt=max(0,gt−1+εt−v) among , g On behalf of the event , Or for the purpose of drift , Model input / Output
- When g t > h {\large g}_{t}>h gt>h sound the alarm , And set up g t = 0 {\large g}_{t}=0 gt=0
- h , v h,v h,v Is an adjustable parameter
Be careful :CUMSUM It's memoryless 、 Unilateral or asymmetrical , Therefore, it can only detect the increase of value .
Page-Hinckley (PH) :
- g 0 = 0 , g t = g t − 1 + ( ε t − v ) {\large g}_{0}=0, {\large g}_{t}= {\large g}_{t-1}+({\large \varepsilon}_{t}-v) g0=0,gt=gt−1+(εt−v)
- G t = m i n ( g t , G t − 1 ) G_{t}=min({\large g}_{t},G_{t-1}) Gt=min(gt,Gt−1)
When g t − G t > h g_{t}-G_{t}>h gt−Gt>h sound the alarm , And set up g t = 0 g_{t}=0 gt=0.
When to use statistical process control methods
Use the statistical process control method introduced , We need to provide the label of the sample . in many instances , It could be a challenge , Because the delay may be high , And it's hard to extract it , Especially when it is used in large organizations . On the other hand , Once these data are obtained , We will get a relatively fast system to cover 3 There are three types of drift : Sudden drift 、 Progressive drift and incremental drift .
The system also allows us to track degradation with the Department ( If any ), To give warnings and alarms .
Time window distribution
Time window distribution model Watch for timestamps and events .
ADWIN
ADWIN The thought is from the time window W W W Start , Dynamically enlarge the window when the context does not change significantly W W W, And shrink it when changes are detected . The algorithm tries to find the W − w 0 W - w_{0} W−w0 and w 1 w_{1} w1 Two sub windows of . This means the old part of the window − w 0 - w_{0} −w0 It is based on data distribution different from the actual , So it's deleted .
Paired Learners
Suppose for a given problem , We have a large stable model trained with a large amount of data , Let's mark it as a model A.
We will also design another model , A more lightweight model , Train on smaller and newer data ( It can have the same type ). We call it a model B.
idea : Find the model B Better than the model A Time window of . Due to the model A Compared to the model B Stabilize and encapsulate more data , We expect it to outperform it . however , If the model B Better than the model A, It may indicate a conceptual drift .
Context method (Contextual Approaches)
The idea of these methods is to evaluate the difference between the training set and the test set . When the difference is significant , It may indicate that there is a drift in the data .
Tree features
The idea of tree features is Train a relatively simple tree on data , And add the prediction timestamp as one of the features . Because the tree model can also be used for feature importance , We can know how time affects data and when . Besides , We can see the split created by timestamp , We can see the difference between the concepts before and after the split .
In the diagram above , We can see that the date feature is at the root , This means that this feature has the highest information gain , This means that 7 month 22 Japan , They may have drifted in the data .
Drift detection implementation
You can find related implementations that provide drift detection :
- Java Realization :MOA
- Python Realization :scikit-multiflow
Link to the original text :Concept Drift Detection Methods
边栏推荐
- Initial experience of domestic database tidb: simple and easy to use, quick to start
- 谷粒商城(一)
- You should know something about ci/cd
- Pytoch deep learning environment construction
- [HCIA continuous update] WAN technology
- 你应该懂些CI/CD
- Face_recognition人脸识别之考勤统计
- To sort out messy header files, I use include what you use
- 【Hot100】31. Next spread
- [test development] software testing - Basics
猜你喜欢
Talk about seven ways to realize asynchronous programming
Imitation of numpy 2
明星开店,退,退,退
创业两年,一家小VC的自我反思
爬虫初级学习
曾经的“彩电大王”,退市前卖猪肉
庆贺!科蓝SUNDB与中创软件完成七大产品的兼容性适配
我写了一份初学者的学习实践教程!
[HCIA continuous update] WLAN overview and basic concepts
celebrate! Kelan sundb and Zhongchuang software complete the compatibility adaptation of seven products
随机推荐
DB-Engines 2022年7月数据库排行榜:Microsoft SQL Server 大涨,Oracle 大跌
ARTS_ twenty million two hundred and twenty thousand six hundred and twenty-eight
What if Kaili can't input Chinese???
[daily question] 556 Next bigger element III
[system analyst's road] Chapter 7 double disk system design (structured development method)
Imitation of numpy 2
Talk about seven ways to realize asynchronous programming
android使用SQLiteOpenHelper闪退
RecastNavigation 之 Recast
The block:usdd has strong growth momentum
Pytorch深度学习之环境搭建
Summary of subsidy policies across the country for dcmm certification in 2022
Vscode modification indentation failed, indent four spaces as soon as it is saved
MySQL common add, delete, modify and query operations (crud)
Lua emmylua annotation details
With the stock price plummeting and the market value shrinking, Naixue launched a virtual stock, which was deeply in dispute
Recast of recastnavigation
【Hot100】32. Longest valid bracket
The Block:USDD增长势头强劲
Clever use of curl command