当前位置：网站首页>Machine learning strong foundation plan 0-5: why is the essence of learning generalization ability?

Machine learning strong foundation plan 0-5: why is the essence of learning generalization ability?

2022-07-28 11:20:00 【Mr.Winter`】

Catalog

0 Write it at the front
1 Fitting problem
2 Generalization ability
3 deviation - Variance dilemma

0 Write it at the front

The machine learning strong foundation program focuses on depth and breadth , Deepen the understanding and application of machine learning models .“ deep ” The mathematical principle behind the detailed derivation algorithm model ;“ wide ” In analyzing multiple machine learning models ： Decision tree 、 Support vector machine 、 Bayesian and Markov decision 、 Strengthen learning, etc .

details ： Machine learning strong foundation program

stay The core knowledge of data set is crosstalk , Construction method analysis We mentioned in The ability of the model to apply to new samples in the sample space is called generalization (generalization), In this section, we will focus on why generalization in machine learning models is so important .

1 Fitting problem

The goal of machine learning algorithm is to know the training error and test error , Try to fit the real law as much as possible to reduce the generalization error . There are two main phenomena in the fitting process ：

Under fitting (underfitting)： It refers to the phenomenon that the learning algorithm lacks the understanding of the law of sampling data, resulting in large training error and generalization error , Under fitting is easily overcome by increasing learning intensity ;
Over fitting (overfitting)： It refers to that the learning algorithm over fits the rules of the sampled data, resulting in treating the personalized features of the data set as general features , Thus, the training error is small but the generalization error is large .

Over fitting is one of the most important fields in machine learning “ Enemy of life ”, Simply speaking , Over fitting is equivalent to learning magic , Only reading , Do the questions in the textbook ( Our given training set ), Meet new questions in the examination room ( Test set or new sample ) I'm so confused . Is such a learning device useful ？ Naturally, it's useless , What we need is a learner that uses finite samples to predict as many unknown samples as possible .

Can it be solved by fitting ？ The answer is ： Over fitting cannot be overcome, but can only alleviate

The reason is that machine learning algorithms usually face NP Class problem —— Problems that cannot be solved in polynomial time . If you can overcome over fitting , By minimizing the training error, it can be calculated in polynomial time NP The optimal solution of a class of problems , In other words, machine learning solves the problems of the century , Proved P=NP, And academia tends to P≠NP The verdict of , Therefore, it is considered that over fitting cannot be overcome .

Insert picture description here

2 Generalization ability

The opposite of over fitting is generalization —— The model is applicable to the ability of new samples in the sample space .

Whether it's human learning or machine learning , What is its highest level ？ See the essence through the phenomenon , Grasp the law .

give an example ： The motion of objects is various , It's complicated , We have summed up Newton's three laws , This is the model we learned from the motion of objects , It has a very strong predictive ability , In the low-speed field, the motion prediction of any object can use Newton's law .

This is generalization , Newton's law is well adapted to new samples . In this case , If Newton's law is not summed up , Instead, a set of specific models are made for each movement of each object , That is over fitting , Because there's another model I've never seen before , Or a sport you've never seen , We must re summarize the model for it .

therefore , The essence of learning is to summarize the rules , Instead of copying data , This is the importance of generalization , There is no general guidance , Let the model fit , It will lead to countless academic garbage .

3 deviation - Variance dilemma

The index to measure generalization performance is called generalization error , Generalization error can be regarded as deviation 、 The combination of variance and noise , Prove the following .

For a given training set $X$ Will produce a model $f_X$ , The machine learning model needs to use different training sets of the same scale for training many times , And take the average performance ：
$\bar{f}\left( \boldsymbol{x} \right) =\mathbb{E} _X\left[ f_X\left( \boldsymbol{x} \right) \right]$

Because the generalization error cannot be obtained directly , Generally, the test error is approximately regarded as the generalization error ( The mean square error is used here )：

$err_g=\mathbb{E} _X\left[ \left( f_X\left( \boldsymbol{x} \right) -y_X \right) ^2 \right]$

among $y_X$ It's a test sample $x$ Tags in the test set . further ：

$\begin{aligned} err_g=\mathbb{E} _X\left[ \left( f_X\left( \boldsymbol{x} \right) -\bar{f}\left( \boldsymbol{x} \right) +\bar{f}\left( \boldsymbol{x} \right) -y_X \right) ^2 \right] \\ =\mathbb{E} _X\left[ \left( f_X\left( \boldsymbol{x} \right) -\bar{f}\left( \boldsymbol{x} \right) \right) ^2 \right] +\mathbb{E} _X\left[ \left( \bar{f}\left( \boldsymbol{x} \right) -y_X \right) ^2 \right] +2\mathbb{E} _X\left[ \left( \bar{f}\left( \boldsymbol{x} \right) -y_X \right) \left( f_X\left( \boldsymbol{x} \right) -\bar{f}\left( \boldsymbol{x} \right) \right) \right] \end{aligned}$

Among them, cross term

$\begin{aligned} \mathbb{E} _X\left[ \left( \bar{f}\left( \boldsymbol{x} \right) -y_X \right) \left( f_X\left( \boldsymbol{x} \right) -\bar{f}\left( \boldsymbol{x} \right) \right) \right] =\mathbb{E} _X\left[ \bar{f}\left( \boldsymbol{x} \right) f_X\left( \boldsymbol{x} \right) -\bar{f}^2\left( \boldsymbol{x} \right) -f_X\left( \boldsymbol{x} \right) y_X+\bar{f}\left( \boldsymbol{x} \right) y_X \right] \\=\bar{f}\left( \boldsymbol{x} \right) \mathbb{E} _X\left[ f_X\left( \boldsymbol{x} \right) \right] -\bar{f}^2\left( \boldsymbol{x} \right) -\mathbb{E} _X\left[ f_X\left( \boldsymbol{x} \right) y_X \right] +\bar{f}\left( \boldsymbol{x} \right) \mathbb{E} _X\left[ y_X \right] \\{\xlongequal[\text{ Independent of the training model }f]{\text{ Test sample label }y_X}}\bar{f}^2\left( \boldsymbol{x} \right) -\bar{f}^2\left( \boldsymbol{x} \right) -\mathbb{E} _X\left[ f_X\left( \boldsymbol{x} \right) \right] \mathbb{E} _X\left[ y_X \right] +\bar{f}\left( \boldsymbol{x} \right) \mathbb{E} _X\left[ y_X \right] \\=0 \end{aligned}$

Introduce the true mark of the test sample $y$ , thus

$err_g=\mathbb{E} _X\left[ \left( f_X\left( \boldsymbol{x} \right) -\bar{f}\left( \boldsymbol{x} \right) \right) ^2 \right] +\mathbb{E} _X\left[ \left( \bar{f}\left( \boldsymbol{x} \right) -y+y-y_X \right) ^2 \right] \\=\mathbb{E} _X\left[ \left( f_X\left( \boldsymbol{x} \right) -\bar{f}\left( \boldsymbol{x} \right) \right) ^2 \right] +\mathbb{E} _X\left[ \left( \bar{f}\left( \boldsymbol{x} \right) -y \right) ^2 \right] +\mathbb{E} _X\left[ \left( y-y_X \right) ^2 \right] +2\mathbb{E} _X\left[ \left( \bar{f}\left( \boldsymbol{x} \right) -y \right) \left( y-y_X \right) \right]$

Among them, cross term

$\mathbb{E} _X\left[ \left( \bar{f}\left( \boldsymbol{x} \right) -y \right) \left( y-y_X \right) \right] {[ \xlongequal[\text{ Independent of the training model }f]{\text{ Test sample label }y_X}}\mathbb{E} _X\left[ \bar{f}\left( \boldsymbol{x} \right) -y \right] \mathbb{E} _X\left[ y-y_X \right] \\{ \xlongequal[]{\text{ Suppose the noise expectation is }0}}0$

Record the variance of the model under different training sets $var\left( \boldsymbol{x} \right) =\mathbb{E} _X\left[ \left( f_X\left( \boldsymbol{x} \right) -\bar{f}\left( \boldsymbol{x} \right) \right) ^2 \right]$ 、 Model deviation expectations $bias^2\left( \boldsymbol{x} \right) =\mathbb{E} _X\left[ \left( \bar{f}\left( \boldsymbol{x} \right) -y \right) ^2 \right]$ 、 The data set noise is expected to be $\varepsilon ^2=\mathbb{E} _X\left[ \left( y-y_X \right) ^2 \right]$

$err_g=var\left( \boldsymbol{x} \right) +bias^2\left( \boldsymbol{x} \right) +\varepsilon ^2}$

Insert picture description here
Intuitively , Deviation and variance are a pair of contradictory relations , be called deviation - Variance dilemma , The optimal model we are looking for is a compromise between two indicators .

The source code for · Technical communication · Tug of learning · Consultation and sharing Please contact the

原网站

版权声明
本文为[Mr.Winter`]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/209/202207281045260990.html