当前位置:网站首页>[R language data science]: cross validation and looking back
[R language data science]: cross validation and looking back
2022-07-04 14:03:00 【JOJO's data analysis Adventure】
【R Language data science 】: Cross verify and look back
- Personal home page :JoJo Data analysis adventure
- Personal introduction : I'm reading statistics in my senior year , At present, Baoyan has reached statistical top3 Colleges and universities continue to study for Postgraduates in Statistics
- If it helps you , welcome
Focus on、give the thumbs-up、Collection、subscribespecial column- This article is included in 【R Language data science 】 This series mainly introduces R The applications of language in the field of data science include :
R Fundamentals of language programming 、R Language visualization 、R Language for data manipulation 、R Modeling language 、R Language machine learning algorithm implementation 、R Language statistical theory and method . This series will continue to be completed , Please pay more attention, praise and support , Learning together ~, Try to keep updating every week , Welcome to subscribe to exchange study !

List of articles
Preface
Cross validation can be used to calculate the test errors associated with a given statistical learning method , To evaluate its performance , Or choose the appropriate level of flexibility , Make a super parameter adjustment . The process of evaluating model performance is called model evaluation , The process of model evaluation to select the appropriate level of flexibility for the model is called model selection . In the last chapter , We discussed the training set and the test set . We tend to pay more attention to the error of the model in the test set ( Test error ). Test error is the average error of statistical learning method in predicting new data sets . Under a given data set , If the test error of a specific statistical learning method is very low , So the effect of this model is good . by comparison , Training error is easy to get and control . Training error is usually very different from test error , Often the training error is greater than the test error . In the absence of a test set that can be used to directly estimate the test error , A variety of mathematical techniques can be used to adjust training errors , To estimate the test error . Then we'll go into detail . In this chapter , We considered the method of dividing the data set . We need to partition the data set : Training set 、 Verification set 、 Test set . Sometimes you put validation sets and test sets together ) Common cross validation methods are 2 individual :
- K Crossover verification
- Leave one method for cross validation
1 K Crossover verification
Our first discussion is K Crossover verification . Generally speaking , Machine learning challenges begin with a Data sets ( Blue in the picture below ). We need to use this data set to build an algorithm , The algorithm will eventually be used for completely independent Data sets ( yellow ).

But in reality , We don't know the Yellow data set .
therefore , To simulate this , We split a part of the data set and pretend that it is an independent data set , As shown in the figure below, we divide the data set into Training set ( Blue ) and Test set ( Red ). We will specialize in Train our algorithm on the training set , And use the test set for evaluation purposes only .
We usually try to select a small number of data sets as test sets , So that we can train with as much data as possible . however , We also hope that the test set is large enough , In this way, we can obtain a stable loss estimate without fitting the unrealistic number of models . The typical choice is to use 20%-30% As a test set . Of course, there are tens of millions of data now , A smaller proportion may be used .
![[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-pdGV1xio-1656858582711)(https://s2.loli.net/2022/07/03/PfUQqM7yVFx9sGL.png)]](/img/85/03960fd63da4798abd2a6f0e74d3fc.png)
Now here is a new question , Because for most machine learning algorithms , We need to choose the super parameter . for example KNN In the algorithm k. These parameters are not trained by the model , That is, we often say the super parameter of the parameter adjustment part , We need to optimize the algorithm parameters without using our test set , We know that if we optimize and evaluate on the same data set , It's easy to over fit (Overfitting). This is where cross validation is most useful . For each set of algorithm parameters , We need a MSE The estimate of , Then we will choose to have the smallest MSE The super parameter of λ \lambda λ.
M S E ( λ ) = 1 k ∑ i = 1 k 1 n ∑ i = 1 n ( y i ^ ( λ ) − y i ) 2 MSE(\lambda) = \frac{1}{k}\sum_{i=1}^{k}\frac{1}{n}\sum_{i=1}^{n}(\hat{y_i}(\lambda)-y_i)^2 MSE(λ)=k1i=1∑kn1i=1∑n(yi^(λ)−yi)2
among k Indicates the number of cross validation ,n Represents the sample size
Now we can further divide the training set into training set and verification set , We fit the model on the training set , Then calculate MSE, As shown in the figure below

Calculation
Validate1Upper MSE:
M S E ( λ ) = 1 m ∑ i = 1 m ( y i ^ ( λ ) − y i ) 2 MSE(\lambda) = \frac{1}{m}\sum_{i=1}^{m}(\hat{y_i}(\lambda)-y_i)^2 MSE(λ)=m1i=1∑m(yi^(λ)−yi)2
among m Represents the samples on each validation set
Be careful : As mentioned above, we only calculated the MSE, To improve generalization , We set it up k Verification sets , First of all, the overall training set is divided into k A Disjoint Set , Select one of the sets as the validation set each time ,k-1 As a training set , And then we get M S E 1 , . . . , M S E k MSE_1,...,MSE_k MSE1,...,MSEk, Then calculate their average :
M S E ( λ ) = 1 k ∑ i = 1 k M S E i ( λ ) MSE(\lambda) = \frac{1}{k}\sum_{i=1}^{k}MSE_i(\lambda) MSE(λ)=k1i=1∑kMSEi(λ)
Our goal is to make this average MSE Minimum

Now we have described how to use cross validation to optimize parameters to select models . however , We must now take into account the fact that the above optimization occurs on the training data , Therefore, we need to estimate the final algorithm based on the data not used for optimal selection . This is where we use early separated test sets :
![[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-AUyEOAWo-1656858582712#pic_center)(https://s2.loli.net/2022/07/03/aehYS8XZutK4W7r.png)]](/img/ea/4a523dec9f706842318c56f3259c33.png)
We can cross verify again :
![[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-Mj0Noh3S-1656858582712)(https://s2.loli.net/2022/07/03/Krce4QBj6AqxW5w.png)]](/img/ba/6eba2e176f43213bced56d4828fb3d.png)
After having the test set , We can evaluate this test set .K The idea of fold cross validation is to divide the data set into K Group . The first group is the test set , The rest k-1 Group as training set . M S E 1 MSE_1 MSE1 It can be seen as the first training , The average error of the test group . repeat k Time , We can get k- Test error of fold cross validation :
C V ( k ) = 1 k ∑ i = 1 k M S E i . CV_{(k)}=\frac{1}{k}\sum_{i=1}^{k}MSE_i. CV(k)=k1i=1∑kMSEi.
And get the final estimate of our expected loss . however , Please note that , This means that our entire calculation time is multiplied by k. Because we are performing many complex calculations . therefore , We need to find ways to reduce time . For the final assessment , We usually use only one test set .
Once we have determined the final model through cross validation , We can re fit the model on the whole data set , Without changing the optimization parameters . In this way, we have more training data . This is effective , Especially when the amount of data is small .

Now we need to consider how to choose K? After a lot of attempts and experience , At present, the commonly used choice is
10or5, This is also r and python Default discount
2 K-fold Cross validation code implementation
cv.glm() function Can achieve k Crossover verification , Let's use k=10, It's also one we often use k value . Let's reset the random seed number , stay Auto Data set fitting model , Fitting polynomial regression of one degree to ten degrees respectively
# 7 It's my lucky number , I just set it up
library(ISLr2)
library(boot)
set.seed(7)
# Initialize the vector to save 10 Results of fold cross validation
cv_error_10 <- rep(0,10)
for (i in 1:10){
glm.fit <- glm(mpg~poly(horsepower,i), data = Auto)
cv_error_10[i] <- cv.glm(Auto,glm.fit,K = 10)$delta[1]
}
cv_error_10
Get the error of each model :
- 24.1463716629577
- 19.3130825829741
- 19.434897545051
- 19.5493689322887
- 19.0736379228708
- 18.7058531603005
- 19.2522869995751
- 18.8552270777634
- 18.9304332711781
- 20.4425474405408
Visualize the results
x <- seq(1,10)
library(ggplot2)
kcv <- data.frame(x,cv=cv_error_10)
ggplot(kcv, aes(x, cv)) +geom_point() + geom_line(lwd=1,col='blue')

Sure See 2 The accuracy of secondary fitting is improved , It can be seen that the accuracy of quadratic fitting is improved higher , Higher fitting does not significantly improve the accuracy of the model
3. Leave one method for cross validation (LOOCV)
Leaving one method for cross validation can be seen as a transformation of the above methods . Similarly, the data set is divided into two parts , Part as validation set , Part of it is a training set , The difference is , We do not choose the same sample size as the validation set here , And just choose a sample ( x 1 , y 1 ) (x_1,y_1) (x1,y1) As validation set . The rest n-1 Samples as training set : ( x 2 , y 2 ) , . . . , ( x n , y n ) {(x_2,y_2),...,(x_n,y_n)} (x2,y2),...,(xn,yn), Fitting model
We are equivalent to doing n Second model training , And then n The average test error of sub fitting is used to estimate the test error of a specific model . The test error obtained from the first training is : M S E 1 = ( y 1 − y ^ 1 ) 2 MSE_1=(y_1-\hat{y}_1)^2 MSE1=(y1−y^1)2. repeat n Time to get : M S E 2 , . . . , M S E n MSE_2,...,MSE_n MSE2,...,MSEn. Finally, we take the average value to LOOCV Estimated test MSE:
C V ( n ) = 1 n ∑ i = 1 n M S E i . CV_{(n)}=\frac{1}{n}\sum_{i=1}^{n}MSE_i. CV(n)=n1i=1∑nMSEi.
LOOCV There are mainly the following points advantage :
- 1. Less deviation , Because we use more data sets for training
- 2. Will not be affected by the randomness of sampling .
Leaving one method for cross validation is a very general Methods , For example, in logistic regression perhaps naive bayes Can be used in .
One disadvantage of the cross validation with the leave one method is that it requires a large amount of calculation , Because we have to fit n Sub model . In some problems now , The amount of data is almost always G Even larger , At this time, fitting the model will take too much time . Compared with k Folding cross validation is an obvious disadvantage .
4. Leave one way to cross verify the code implementation
loocv The results can be used directly glm() Functions and cv.glm() Function to get . We used glm() The function fits logistic Return to , Remember we used family Parameters =binomal, If we don't specify family value , The default is linear regression , and lm() The function is the same . Let's use glm() and cv.glm() To achieve LOOCV, First, import. boot package
library(boot)
glm.fit <- glm(mpg~horsepower, data = Auto)
cv.err <- cv.glm(Auto,glm.fit)
cv.err$delta
- 24.2315135179292
- 24.2311440937562
You can see the above two MSE Almost the same , because LOOCV The results of almost every training are the same . Be careful , In the use of cv.glm() when , There is no need to specify train, It's automatic LOOCV The process . Now let's look at the polynomial results to different powers
cv.error <- rep(0,10)
for (i in 1:10) {
glm.fit <- glm(mpg~poly(horsepower,i),data= Auto)
cv.error[i] <- cv.glm(Auto, glm.fit)$delta[1]
}
cv.error
- 24.2315135179293
- 19.2482131244897
- 19.334984064029
- 19.4244303104303
- 19.0332138547041
- 18.9786436582254
- 18.8330450653183
- 18.9611507120531
- 19.0686299814599
- 19.490932299334
The results show that the error rate decreases greatly when fitting twice , This and K Fold cross validation is consistent
Now let's visualize the results
loocv <- data.frame(x,cv=cv.error)
library(ggplot2)
ggplot(loocv, aes(x, cv.error)) +geom_point() + geom_line(lwd=1,col='blue')

5. summary
We said K Folding cross validation is better in operation than leaving one method cross validation . But excluding the amount of computation , Another important advantage is that usually k The estimation of test error by fold cross validation is more accurate than that by leave one method .
We talked about that before , Leaving one method for cross validation is almost unbiased , Because he used n-1 Data training . alike k Fold cross validation whether k=5 and k=10 Will lead to a certain deviation . If only from the perspective of deviation , Leaving one method seems to perform better . But we also have to consider the problem of variance . about LOOCV, We are actually averaging the output of the fitting model , Each model is trained on almost the same observation set ; therefore , These outputs are highly relative to each other ( just ) relevant . by comparison , When we're in k<n In case of execution K fold CV when , We have little correlation with each other k The outputs of the two fitting models are averaged . Because the training set coincidence degree of each model is small . therefore LOOCV There will be a large variance . One way to improve the estimated variance is to get more samples . So , We no longer need to divide the training set into non overlapping sets . contrary , We will choose random K Sets of the same size .
in general ,K Fold cross validation takes k=5 or k=10 It's a good level , The variance and deviation are small .
This is the introduction of this chapter , If it helps you , Please do more thumb up 、 Collection 、 Comment on 、 Focus on supporting !!
边栏推荐
- 華昊中天沖刺科創板:年虧2.8億擬募資15億 貝達藥業是股東
- 2022kdd pre lecture | 11 first-class scholars take you to unlock excellent papers in advance
- 1200. 最小绝对差
- unity不识别rider的其中一种解决方法
- 奇妙秘境 码蹄集
- 易周金融 | Q1保险行业活跃人数8688.67万人 19家支付机构牌照被注销
- ASP. Net core introduction I
- Summary of recent days (non-technical article)
- 30:第三章:开发通行证服务:13:开发【更改/完善用户信息,接口】;(使用***BO类承接参数,并使用了参数校验)
- 字节面试算法题
猜你喜欢

【Antd】Antd 如何在 Form.Item 中有 Input.Gourp 时获取 Input.Gourp 的每一个 Input 的value

Huahao Zhongtian sprint Technology Innovation Board: perte annuelle de 280 millions de RMB, projet de collecte de fonds de 1,5 milliard de Beida Pharmaceutical est actionnaire

Interviewer: what is the internal implementation of hash data type in redis?

Flet tutorial 03 basic introduction to filledbutton (tutorial includes source code) (tutorial includes source code)

Openharmony application development how to create dayu200 previewer

自主工业软件的创新与发展

吃透Chisel语言.12.Chisel项目构建、运行和测试(四)——Chisel测试之ChiselTest

CVPR 2022 | 大幅减少零样本学习所需的人工标注,提出富含视觉信息的类别语义嵌入(源代码下载)...

字节面试算法题

Redis - how to install redis and configuration (how to quickly install redis on ubuntu18.04 and centos7.6 Linux systems)
随机推荐
JVM memory layout detailed, illustrated, well written!
C语言集合运算
go语言中的文件创建,写入,读取,删除(转)
常见 content-type对应表
源码编译安装MySQL
2022年起重机械指挥考试模拟100题模拟考试平台操作
Unity Shader学习(三)试着绘制一个圆
数据库公共字段自动填充
How to choose a technology stack for web applications in 2022
基于STM32+华为云IOT设计的酒驾监控系统
Doctoral application | West Lake University Learning and reasoning system laboratory recruits postdoctoral / doctoral / research internship, etc
Scripy framework learning
舔狗舔到最后一无所有(状态机)
易周金融 | Q1保险行业活跃人数8688.67万人 19家支付机构牌照被注销
美国土安全部部长警告移民“不要踏上危险的旅程”
Node の MongoDB安装
js中的变量提升和函数提升
JVM series - stack and heap, method area day1-2
Gorm 读写分离(转)
#yyds干货盘点# 解决名企真题:连续最大和