当前位置:网站首页>Machine learning strong foundation plan 0-5: why is the essence of learning generalization ability?
Machine learning strong foundation plan 0-5: why is the essence of learning generalization ability?
2022-07-28 11:20:00 【Mr.Winter`】
Catalog
0 Write it at the front
The machine learning strong foundation program focuses on depth and breadth , Deepen the understanding and application of machine learning models .“ deep ” The mathematical principle behind the detailed derivation algorithm model ;“ wide ” In analyzing multiple machine learning models : Decision tree 、 Support vector machine 、 Bayesian and Markov decision 、 Strengthen learning, etc .
details : Machine learning strong foundation program
stay The core knowledge of data set is crosstalk , Construction method analysis We mentioned in The ability of the model to apply to new samples in the sample space is called generalization (generalization), In this section, we will focus on why generalization in machine learning models is so important .
1 Fitting problem
The goal of machine learning algorithm is to know the training error and test error , Try to fit the real law as much as possible to reduce the generalization error . There are two main phenomena in the fitting process :
- Under fitting (underfitting): It refers to the phenomenon that the learning algorithm lacks the understanding of the law of sampling data, resulting in large training error and generalization error , Under fitting is easily overcome by increasing learning intensity ;
- Over fitting (overfitting): It refers to that the learning algorithm over fits the rules of the sampled data, resulting in treating the personalized features of the data set as general features , Thus, the training error is small but the generalization error is large .
Over fitting is one of the most important fields in machine learning “ Enemy of life ”, Simply speaking , Over fitting is equivalent to learning magic , Only reading , Do the questions in the textbook ( Our given training set ), Meet new questions in the examination room ( Test set or new sample ) I'm so confused . Is such a learning device useful ? Naturally, it's useless , What we need is a learner that uses finite samples to predict as many unknown samples as possible .
Can it be solved by fitting ? The answer is : Over fitting cannot be overcome, but can only alleviate
The reason is that machine learning algorithms usually face NP Class problem —— Problems that cannot be solved in polynomial time . If you can overcome over fitting , By minimizing the training error, it can be calculated in polynomial time NP The optimal solution of a class of problems , In other words, machine learning solves the problems of the century , Proved P=NP, And academia tends to P≠NP The verdict of , Therefore, it is considered that over fitting cannot be overcome .

2 Generalization ability
The opposite of over fitting is generalization —— The model is applicable to the ability of new samples in the sample space .
Whether it's human learning or machine learning , What is its highest level ? See the essence through the phenomenon , Grasp the law .
give an example : The motion of objects is various , It's complicated , We have summed up Newton's three laws , This is the model we learned from the motion of objects , It has a very strong predictive ability , In the low-speed field, the motion prediction of any object can use Newton's law .
This is generalization , Newton's law is well adapted to new samples . In this case , If Newton's law is not summed up , Instead, a set of specific models are made for each movement of each object , That is over fitting , Because there's another model I've never seen before , Or a sport you've never seen , We must re summarize the model for it .
therefore , The essence of learning is to summarize the rules , Instead of copying data , This is the importance of generalization , There is no general guidance , Let the model fit , It will lead to countless academic garbage .
3 deviation - Variance dilemma
The index to measure generalization performance is called generalization error , Generalization error can be regarded as deviation 、 The combination of variance and noise , Prove the following .
For a given training set X X X Will produce a model f X f_X fX, The machine learning model needs to use different training sets of the same scale for training many times , And take the average performance :
f ˉ ( x ) = E X [ f X ( x ) ] \bar{f}\left( \boldsymbol{x} \right) =\mathbb{E} _X\left[ f_X\left( \boldsymbol{x} \right) \right] fˉ(x)=EX[fX(x)]
Because the generalization error cannot be obtained directly , Generally, the test error is approximately regarded as the generalization error ( The mean square error is used here ):
e r r g = E X [ ( f X ( x ) − y X ) 2 ] err_g=\mathbb{E} _X\left[ \left( f_X\left( \boldsymbol{x} \right) -y_X \right) ^2 \right] errg=EX[(fX(x)−yX)2]
among y X y_X yX It's a test sample x x x Tags in the test set . further :
e r r g = E X [ ( f X ( x ) − f ˉ ( x ) + f ˉ ( x ) − y X ) 2 ] = E X [ ( f X ( x ) − f ˉ ( x ) ) 2 ] + E X [ ( f ˉ ( x ) − y X ) 2 ] + 2 E X [ ( f ˉ ( x ) − y X ) ( f X ( x ) − f ˉ ( x ) ) ] \begin{aligned} err_g=\mathbb{E} _X\left[ \left( f_X\left( \boldsymbol{x} \right) -\bar{f}\left( \boldsymbol{x} \right) +\bar{f}\left( \boldsymbol{x} \right) -y_X \right) ^2 \right] \\ =\mathbb{E} _X\left[ \left( f_X\left( \boldsymbol{x} \right) -\bar{f}\left( \boldsymbol{x} \right) \right) ^2 \right] +\mathbb{E} _X\left[ \left( \bar{f}\left( \boldsymbol{x} \right) -y_X \right) ^2 \right] +2\mathbb{E} _X\left[ \left( \bar{f}\left( \boldsymbol{x} \right) -y_X \right) \left( f_X\left( \boldsymbol{x} \right) -\bar{f}\left( \boldsymbol{x} \right) \right) \right] \end{aligned} errg=EX[(fX(x)−fˉ(x)+fˉ(x)−yX)2]=EX[(fX(x)−fˉ(x))2]+EX[(fˉ(x)−yX)2]+2EX[(fˉ(x)−yX)(fX(x)−fˉ(x))]
Among them, cross term
E X [ ( f ˉ ( x ) − y X ) ( f X ( x ) − f ˉ ( x ) ) ] = E X [ f ˉ ( x ) f X ( x ) − f ˉ 2 ( x ) − f X ( x ) y X + f ˉ ( x ) y X ] = f ˉ ( x ) E X [ f X ( x ) ] − f ˉ 2 ( x ) − E X [ f X ( x ) y X ] + f ˉ ( x ) E X [ y X ] = Independent of the training model f Test sample label y X f ˉ 2 ( x ) − f ˉ 2 ( x ) − E X [ f X ( x ) ] E X [ y X ] + f ˉ ( x ) E X [ y X ] = 0 \begin{aligned} \mathbb{E} _X\left[ \left( \bar{f}\left( \boldsymbol{x} \right) -y_X \right) \left( f_X\left( \boldsymbol{x} \right) -\bar{f}\left( \boldsymbol{x} \right) \right) \right] =\mathbb{E} _X\left[ \bar{f}\left( \boldsymbol{x} \right) f_X\left( \boldsymbol{x} \right) -\bar{f}^2\left( \boldsymbol{x} \right) -f_X\left( \boldsymbol{x} \right) y_X+\bar{f}\left( \boldsymbol{x} \right) y_X \right] \\=\bar{f}\left( \boldsymbol{x} \right) \mathbb{E} _X\left[ f_X\left( \boldsymbol{x} \right) \right] -\bar{f}^2\left( \boldsymbol{x} \right) -\mathbb{E} _X\left[ f_X\left( \boldsymbol{x} \right) y_X \right] +\bar{f}\left( \boldsymbol{x} \right) \mathbb{E} _X\left[ y_X \right] \\{\xlongequal[\text{ Independent of the training model }f]{\text{ Test sample label }y_X}}\bar{f}^2\left( \boldsymbol{x} \right) -\bar{f}^2\left( \boldsymbol{x} \right) -\mathbb{E} _X\left[ f_X\left( \boldsymbol{x} \right) \right] \mathbb{E} _X\left[ y_X \right] +\bar{f}\left( \boldsymbol{x} \right) \mathbb{E} _X\left[ y_X \right] \\=0 \end{aligned} EX[(fˉ(x)−yX)(fX(x)−fˉ(x))]=EX[fˉ(x)fX(x)−fˉ2(x)−fX(x)yX+fˉ(x)yX]=fˉ(x)EX[fX(x)]−fˉ2(x)−EX[fX(x)yX]+fˉ(x)EX[yX] Test sample label yX Independent of the training model ffˉ2(x)−fˉ2(x)−EX[fX(x)]EX[yX]+fˉ(x)EX[yX]=0
Introduce the true mark of the test sample y y y, thus
e r r g = E X [ ( f X ( x ) − f ˉ ( x ) ) 2 ] + E X [ ( f ˉ ( x ) − y + y − y X ) 2 ] = E X [ ( f X ( x ) − f ˉ ( x ) ) 2 ] + E X [ ( f ˉ ( x ) − y ) 2 ] + E X [ ( y − y X ) 2 ] + 2 E X [ ( f ˉ ( x ) − y ) ( y − y X ) ] err_g=\mathbb{E} _X\left[ \left( f_X\left( \boldsymbol{x} \right) -\bar{f}\left( \boldsymbol{x} \right) \right) ^2 \right] +\mathbb{E} _X\left[ \left( \bar{f}\left( \boldsymbol{x} \right) -y+y-y_X \right) ^2 \right] \\=\mathbb{E} _X\left[ \left( f_X\left( \boldsymbol{x} \right) -\bar{f}\left( \boldsymbol{x} \right) \right) ^2 \right] +\mathbb{E} _X\left[ \left( \bar{f}\left( \boldsymbol{x} \right) -y \right) ^2 \right] +\mathbb{E} _X\left[ \left( y-y_X \right) ^2 \right] +2\mathbb{E} _X\left[ \left( \bar{f}\left( \boldsymbol{x} \right) -y \right) \left( y-y_X \right) \right] errg=EX[(fX(x)−fˉ(x))2]+EX[(fˉ(x)−y+y−yX)2]=EX[(fX(x)−fˉ(x))2]+EX[(fˉ(x)−y)2]+EX[(y−yX)2]+2EX[(fˉ(x)−y)(y−yX)]
Among them, cross term
E X [ ( f ˉ ( x ) − y ) ( y − y X ) ] [ = Independent of the training model f Test sample label y X E X [ f ˉ ( x ) − y ] E X [ y − y X ] = Suppose the noise expectation is 0 0 \mathbb{E} _X\left[ \left( \bar{f}\left( \boldsymbol{x} \right) -y \right) \left( y-y_X \right) \right] {[ \xlongequal[\text{ Independent of the training model }f]{\text{ Test sample label }y_X}}\mathbb{E} _X\left[ \bar{f}\left( \boldsymbol{x} \right) -y \right] \mathbb{E} _X\left[ y-y_X \right] \\{ \xlongequal[]{\text{ Suppose the noise expectation is }0}}0 EX[(fˉ(x)−y)(y−yX)][ Test sample label yX Independent of the training model fEX[fˉ(x)−y]EX[y−yX] Suppose the noise expectation is 00
Record the variance of the model under different training sets v a r ( x ) = E X [ ( f X ( x ) − f ˉ ( x ) ) 2 ] var\left( \boldsymbol{x} \right) =\mathbb{E} _X\left[ \left( f_X\left( \boldsymbol{x} \right) -\bar{f}\left( \boldsymbol{x} \right) \right) ^2 \right] var(x)=EX[(fX(x)−fˉ(x))2]、 Model deviation expectations b i a s 2 ( x ) = E X [ ( f ˉ ( x ) − y ) 2 ] bias^2\left( \boldsymbol{x} \right) =\mathbb{E} _X\left[ \left( \bar{f}\left( \boldsymbol{x} \right) -y \right) ^2 \right] bias2(x)=EX[(fˉ(x)−y)2]、 The data set noise is expected to be ε 2 = E X [ ( y − y X ) 2 ] \varepsilon ^2=\mathbb{E} _X\left[ \left( y-y_X \right) ^2 \right] ε2=EX[(y−yX)2]
e r r g = v a r ( x ) + b i a s 2 ( x ) + ε 2 { err_g=var\left( \boldsymbol{x} \right) +bias^2\left( \boldsymbol{x} \right) +\varepsilon ^2} errg=var(x)+bias2(x)+ε2

Intuitively , Deviation and variance are a pair of contradictory relations , be called deviation - Variance dilemma , The optimal model we are looking for is a compromise between two indicators .
边栏推荐
- Status Notice ¶
- Zero code | easily realize data warehouse modeling and build Bi Kanban
- Usage of memory operation functions memcpy() and memmove()
- Offsetof macro and container_ Of macro analysis details
- nodejs:检测并安装npm模块,如果已安装则跳过
- Microsoft security team found an Austrian company that used windows Zero Day vulnerability to sell spyware
- Nodejs: set up the express service, set up the session and realize the exit operation
- Ten questions about low code: tell everything about low code!
- Leetcode:981. time based key value storage [trap of iteration for: on]
- 关于结构体指针函数的返回值传递给结构体指针的理解
猜你喜欢

开源汇智创未来 | 2022开放原子全球开源峰会OpenAtom openEuler分论坛圆满召开

Blue Bridge Cup embedded Hal library ADC

Do data analysis, do you still not understand RFM analysis method (model)?

Blue Bridge Cup embedded Hal library LCD

keil和IAR中lib库文件的生成和使用

JS - modify the key name of the object in the array

leetcode:1300. 转变数组后最接近目标值的数组和【二分】

1天涨粉81W,打造爆款短视频的秘诀是什么?

Sword finger offer 30. stack containing min function

CRM+零代码:轻松实现企业信息化
随机推荐
nodejs:mongodb 插入成功之后的返回值
Office2013 input mathematical formula above
JS - 修改数组中对象的键名
Sword finger offer 30. stack containing min function
五面阿里技术专家岗,已拿offer,这些面试题你能答出多少
技术分享| 快对讲综合调度系统
苹果手机iCloud钥匙串的加密缺陷
Purchase, sale and inventory software suitable for small and medium-sized enterprises to solve five major problems
Nodejs: detect and install the NPM module. If it is already installed, skip
Inventory: 144 free learning websites, the most complete collection of resources in the whole network
C language to convert float data into BCD data
用手机对电脑进行远程关机
19. Delete the penultimate node of the linked list
Clo********e: project management notes
这里有一份超实用Excel快捷键合集(常用+八大类汇总)
Do you want to enroll in class for advanced soft exam
_ HUGE and __ IMP__ HUGE in “math.h“
c语言实现float型数据转成BCD数据
一文学会如何做电商数据分析(附运营分析指标框架)
做数据分析,你还不懂RFM分析方法(模型)?