当前位置：网站首页>[R language data science]: cross validation and looking back

[R language data science]: cross validation and looking back

2022-07-04 14:03:00 【JOJO's data analysis Adventure】

【R Language data science 】： Cross verify and look back

Personal home page ：JoJo Data analysis adventure
Personal introduction ： I'm reading statistics in my senior year , At present, Baoyan has reached statistical top3 Colleges and universities continue to study for Postgraduates in Statistics
If it helps you , welcome Focus on 、 give the thumbs-up 、 Collection 、 subscribe special column
This article is included in 【R Language data science 】 This series mainly introduces R The applications of language in the field of data science include ：
R Fundamentals of language programming 、R Language visualization 、R Language for data manipulation 、R Modeling language 、R Language machine learning algorithm implementation 、R Language statistical theory and method . This series will continue to be completed , Please pay more attention, praise and support , Learning together ~, Try to keep updating every week , Welcome to subscribe to exchange study ！

Please add a picture description

List of articles

【R Language data science 】： Cross verify and look back
Preface
1 K Crossover verification
2 K-fold Cross validation code implementation
3. Leave one method for cross validation （LOOCV）
4. Leave one way to cross verify the code implementation
5. summary

Preface

Cross validation can be used to calculate the test errors associated with a given statistical learning method , To evaluate its performance , Or choose the appropriate level of flexibility , Make a super parameter adjustment . The process of evaluating model performance is called model evaluation , The process of model evaluation to select the appropriate level of flexibility for the model is called model selection . In the last chapter , We discussed the training set and the test set . We tend to pay more attention to the error of the model in the test set （ Test error ）. Test error is the average error of statistical learning method in predicting new data sets . Under a given data set , If the test error of a specific statistical learning method is very low , So the effect of this model is good . by comparison , Training error is easy to get and control . Training error is usually very different from test error , Often the training error is greater than the test error . In the absence of a test set that can be used to directly estimate the test error , A variety of mathematical techniques can be used to adjust training errors , To estimate the test error . Then we'll go into detail . In this chapter , We considered the method of dividing the data set . We need to partition the data set ： Training set 、 Verification set 、 Test set . Sometimes you put validation sets and test sets together ） Common cross validation methods are 2 individual :

K Crossover verification
Leave one method for cross validation

1 K Crossover verification

Our first discussion is K Crossover verification . Generally speaking , Machine learning challenges begin with a Data sets （ Blue in the picture below ）. We need to use this data set to build an algorithm , The algorithm will eventually be used for completely independent Data sets （ yellow ）.

But in reality , We don't know the Yellow data set .
therefore , To simulate this , We split a part of the data set and pretend that it is an independent data set , As shown in the figure below, we divide the data set into Training set （ Blue ） and Test set （ Red ）. We will specialize in Train our algorithm on the training set , And use the test set for evaluation purposes only .
We usually try to select a small number of data sets as test sets , So that we can train with as much data as possible . however , We also hope that the test set is large enough , In this way, we can obtain a stable loss estimate without fitting the unrealistic number of models . The typical choice is to use 20%-30% As a test set . Of course, there are tens of millions of data now , A smaller proportion may be used .

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-pdGV1xio-1656858582711)(https://s2.loli.net/2022/07/03/PfUQqM7yVFx9sGL.png)]

Now here is a new question , Because for most machine learning algorithms , We need to choose the super parameter . for example KNN In the algorithm k. These parameters are not trained by the model , That is, we often say the super parameter of the parameter adjustment part , We need to optimize the algorithm parameters without using our test set , We know that if we optimize and evaluate on the same data set , It's easy to over fit （Overfitting）. This is where cross validation is most useful . For each set of algorithm parameters , We need a MSE The estimate of , Then we will choose to have the smallest MSE The super parameter of $\lambda$ .
$MSE(\lambda) = \frac{1}{k}\sum_{i=1}^{k}\frac{1}{n}\sum_{i=1}^{n}(\hat{y_i}(\lambda)-y_i)^2$
among k Indicates the number of cross validation ,n Represents the sample size

Now we can further divide the training set into training set and verification set , We fit the model on the training set , Then calculate MSE, As shown in the figure below

Insert picture description here

Calculation Validate1 Upper MSE：
$MSE(\lambda) = \frac{1}{m}\sum_{i=1}^{m}(\hat{y_i}(\lambda)-y_i)^2$
among m Represents the samples on each validation set

Be careful ： As mentioned above, we only calculated the MSE, To improve generalization , We set it up k Verification sets , First of all, the overall training set is divided into k A Disjoint Set , Select one of the sets as the validation set each time ,k-1 As a training set , And then we get $MSE_1,...,MSE_k$ , Then calculate their average ：
$MSE(\lambda) = \frac{1}{k}\sum_{i=1}^{k}MSE_i(\lambda)$
Our goal is to make this average MSE Minimum

Now we have described how to use cross validation to optimize parameters to select models . however , We must now take into account the fact that the above optimization occurs on the training data , Therefore, we need to estimate the final algorithm based on the data not used for optimal selection . This is where we use early separated test sets ：

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-AUyEOAWo-1656858582712#pic_center)(https://s2.loli.net/2022/07/03/aehYS8XZutK4W7r.png)]

We can cross verify again ：

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-Mj0Noh3S-1656858582712)(https://s2.loli.net/2022/07/03/Krce4QBj6AqxW5w.png)]

After having the test set , We can evaluate this test set .K The idea of fold cross validation is to divide the data set into K Group . The first group is the test set , The rest k-1 Group as training set . $MSE_1$ It can be seen as the first training , The average error of the test group . repeat k Time , We can get k- Test error of fold cross validation ：
$CV_{(k)}=\frac{1}{k}\sum_{i=1}^{k}MSE_i.$

And get the final estimate of our expected loss . however , Please note that , This means that our entire calculation time is multiplied by k. Because we are performing many complex calculations . therefore , We need to find ways to reduce time . For the final assessment , We usually use only one test set .

Once we have determined the final model through cross validation , We can re fit the model on the whole data set , Without changing the optimization parameters . In this way, we have more training data . This is effective , Especially when the amount of data is small .

Now we need to consider how to choose K？ After a lot of attempts and experience , At present, the commonly used choice is 10 or 5, This is also r and python Default discount

2 K-fold Cross validation code implementation

cv.glm() function Can achieve k Crossover verification , Let's use k=10, It's also one we often use k value . Let's reset the random seed number , stay Auto Data set fitting model , Fitting polynomial regression of one degree to ten degrees respectively

# 7 It's my lucky number , I just set it up 
library(ISLr2)
library(boot)
set.seed(7)
#  Initialize the vector to save 10 Results of fold cross validation 
cv_error_10 <- rep(0,10)
for (i in 1:10){
    
    glm.fit <- glm(mpg~poly(horsepower,i), data = Auto)
    cv_error_10[i] <- cv.glm(Auto,glm.fit,K = 10)$delta[1]
}
cv_error_10

Get the error of each model ：

24.1463716629577
19.3130825829741
19.434897545051
19.5493689322887
19.0736379228708
18.7058531603005
19.2522869995751
18.8552270777634
18.9304332711781
20.4425474405408

Visualize the results

x <- seq(1,10)
library(ggplot2)
kcv <- data.frame(x,cv=cv_error_10)
ggplot(kcv, aes(x, cv)) +geom_point() + geom_line(lwd=1,col='blue')

png

Sure See 2 The accuracy of secondary fitting is improved , It can be seen that the accuracy of quadratic fitting is improved higher , Higher fitting does not significantly improve the accuracy of the model

3. Leave one method for cross validation （LOOCV）

Leaving one method for cross validation can be seen as a transformation of the above methods . Similarly, the data set is divided into two parts , Part as validation set , Part of it is a training set , The difference is , We do not choose the same sample size as the validation set here , And just choose a sample $x_1,y_1)$ As validation set . The rest n-1 Samples as training set : ${(x_2,y_2),...,(x_n,y_n)}$ , Fitting model

We are equivalent to doing n Second model training , And then n The average test error of sub fitting is used to estimate the test error of a specific model . The test error obtained from the first training is ： $MSE_1=(y_1-\hat{y}_1)^2$ . repeat n Time to get ： $MSE_2,...,MSE_n$ . Finally, we take the average value to LOOCV Estimated test MSE:
$CV_{(n)}=\frac{1}{n}\sum_{i=1}^{n}MSE_i.$
LOOCV There are mainly the following points advantage ：

1. Less deviation , Because we use more data sets for training
2. Will not be affected by the randomness of sampling .

Leaving one method for cross validation is a very general Methods , For example, in logistic regression perhaps naive bayes Can be used in .
One disadvantage of the cross validation with the leave one method is that it requires a large amount of calculation , Because we have to fit n Sub model . In some problems now , The amount of data is almost always G Even larger , At this time, fitting the model will take too much time . Compared with k Folding cross validation is an obvious disadvantage .

4. Leave one way to cross verify the code implementation

loocv The results can be used directly glm() Functions and cv.glm() Function to get . We used glm() The function fits logistic Return to , Remember we used family Parameters =binomal, If we don't specify family value , The default is linear regression , and lm() The function is the same . Let's use glm() and cv.glm() To achieve LOOCV, First, import. boot package

library(boot)
glm.fit <- glm(mpg~horsepower, data = Auto)
cv.err <- cv.glm(Auto,glm.fit)
cv.err$delta

24.2315135179292
24.2311440937562

You can see the above two MSE Almost the same , because LOOCV The results of almost every training are the same . Be careful , In the use of cv.glm() when , There is no need to specify train, It's automatic LOOCV The process . Now let's look at the polynomial results to different powers

cv.error <- rep(0,10)
for (i in 1:10) {
    
    glm.fit <- glm(mpg~poly(horsepower,i),data= Auto)
    cv.error[i] <- cv.glm(Auto, glm.fit)$delta[1]
}
cv.error

24.2315135179293
19.2482131244897
19.334984064029
19.4244303104303
19.0332138547041
18.9786436582254
18.8330450653183
18.9611507120531
19.0686299814599
19.490932299334

The results show that the error rate decreases greatly when fitting twice , This and K Fold cross validation is consistent

Now let's visualize the results

loocv <- data.frame(x,cv=cv.error)
library(ggplot2)
ggplot(loocv, aes(x, cv.error)) +geom_point() + geom_line(lwd=1,col='blue')

Insert picture description here

5. summary

We said K Folding cross validation is better in operation than leaving one method cross validation . But excluding the amount of computation , Another important advantage is that usually k The estimation of test error by fold cross validation is more accurate than that by leave one method .
We talked about that before , Leaving one method for cross validation is almost unbiased , Because he used n-1 Data training . alike k Fold cross validation whether k=5 and k=10 Will lead to a certain deviation . If only from the perspective of deviation , Leaving one method seems to perform better . But we also have to consider the problem of variance . about LOOCV, We are actually averaging the output of the fitting model , Each model is trained on almost the same observation set ; therefore , These outputs are highly relative to each other （ just ） relevant . by comparison , When we're in k<n In case of execution K fold CV when , We have little correlation with each other k The outputs of the two fitting models are averaged . Because the training set coincidence degree of each model is small . therefore LOOCV There will be a large variance . One way to improve the estimated variance is to get more samples . So , We no longer need to divide the training set into non overlapping sets . contrary , We will choose random K Sets of the same size .
in general ,K Fold cross validation takes k=5 or k=10 It's a good level , The variance and deviation are small .