当前位置:网站首页>[machine learning Q & A] data sampling and model verification methods, hyperparametric optimization, over fitting and under fitting problems
[machine learning Q & A] data sampling and model verification methods, hyperparametric optimization, over fitting and under fitting problems
2022-06-30 01:25:00 【Sickle leek】
Data sampling and model validation methods 、 Hyperparametric optimization and over fitting and under fitting problems
- Sample data sampling and model validation methods
- problem 1: In the process of model evaluation , What are the main verification methods , What are their advantages and disadvantages ?
- problem 2: In the process of bootstrap sampling , Yes n A sample of n Second self-service sampling , When n As we go to infinity , In the end, how many pieces of data have never been selected ?
- Super parameter tuning
- Over fitting and under fitting problems
- Reference material
Sample data sampling and model validation methods
In machine learning , The samples are usually divided into training set and test set , The training set is used to train the model , Test sets are used to evaluate models . In the process of sample division and model verification , There are different sampling and verification methods . that
problem 1: In the process of model evaluation , What are the main verification methods , What are their advantages and disadvantages ?
(1)Holdout test :Holdout test It is the simplest and most direct verification method , It randomly divides the original sample set into Training set and Verification set Two parts .
For example , For a click through rate prediction model , Let's take the sample according to 70%~30% The proportion is divided into two parts ,70% Samples for model training .30% For model validation , Including drawing ROC curve , Calculate the accuracy rate and recall rate to evaluate the model performance .
Holdout The shortcomings of the test are obvious : namely The final evaluation index calculated on the verification set has a great relationship with the original group . To eliminate this randomness , The researchers introduced “ Cross examination ”.
(2) Cross examination k-fold Cross examination : First, divide all the samples into k k k Sample subsets of equal size . Traverse this one by one k k k A subset of , Take the current subset as the validation set every time , All the other subsets as training sets , Model training and evaluation . Finally, put k k k The average value of the times' evaluation index is taken as the final evaluation index . In a practical experiment , k k k Frequent value 10. Leave a verification : Every time I stay 1 Samples as validation set , All other samples are used as training sets , The total number of samples is n, According to this n Samples for traversal , Conduct n Time verification , Then average the evaluation index to the final evaluation index .
In the case of a large number of samples , It takes a lot of time to leave a validation method . in fact , To keep a verification is to keep p p p Special case of verification , leave p p p Verification is to stay... Every time p p p Samples as validation set , from n Of the elements p p p The elements are C n p C_n^p Cnp Possibilities , So its time cost is much higher than leaving a verification , So I seldom use .
(3) Self help law
Whether it's Holdout Test or cross test , Are based on partition Training set and Test set Method for model evaluation . then , When the sample size is small , Dividing the sample set will further reduce the training set , This may affect the effect of model training . Is there a verification method that can maintain the sample size of the training set ? Self help method can better solve this problem ;
Self help law Inspection method based on self-service sampling method . For a total of n A sample set of , Conduct n Time Put it back Random sampling , The magnitude is n Training set of .n During subsampling , Some samples will be repeated , Some samples have not been taken , Take these samples that have not been sampled as the validation set , Model validation , This is the self-help verification process .
problem 2: In the process of bootstrap sampling , Yes n A sample of n Second self-service sampling , When n As we go to infinity , In the end, how many pieces of data have never been selected ?
The probability that a sample is not selected in a sampling process is ( 1 − 1 n ) (1-\frac{1}{n}) (1−n1),n The probability that every sampling is successful is ( 1 − 1 n ) n (1-\frac{1}{n})^n (1−n1)n. When n As we go to infinity , The probability of lim x → ∞ ( 1 − 1 n ) n \lim_{x \to \infty} (1-\frac{1}{n})^n limx→∞(1−n1)n. According to the important limit : lim n → ∞ ( 1 + 1 n ) n = e \lim_{n\to \infty}(1+\frac{1}{n})^n=e limn→∞(1+n1)n=e, So there is
lim x → ∞ ( 1 − 1 n ) n = lim n → ∞ ( n − 1 n ) n = lim n → ∞ ( 1 n n − 1 ) n = lim n → ∞ 1 ( 1 + 1 n − 1 ) n = 1 lim n → ∞ ( 1 + 1 n − 1 ) n − 1 ⋅ 1 lim n → ∞ ( 1 + 1 n − 1 ) = 1 e ≈ 0.368 \lim_{x \to \infty} (1-\frac{1}{n})^n=\lim_{n \to \infty} (\frac{n-1}{n})^n=\lim_{n\to \infty}(\frac{1}{\frac{n}{n-1}})^n=\lim_{n \to \infty}\frac{1}{(1+\frac{1}{n-1})^n}=\frac{1}{\lim_{n\to \infty} (1+\frac{1}{n-1})^{n-1}}\cdot \frac{1}{\lim_{n\to \infty}(1+\frac{1}{n-1})}=\frac{1}{e}\approx 0.368 x→∞lim(1−n1)n=n→∞lim(nn−1)n=n→∞lim(n−1n1)n=n→∞lim(1+n−11)n1=limn→∞(1+n−11)n−11⋅limn→∞(1+n−11)1=e1≈0.368
therefore , When the number of samples is large , There are about 36.8% A sample of has never been selected , Can be used as a validation set .
Super parameter tuning
problem 1: What are the tuning methods for hyper parameters ?
For super parameter tuning , We usually use The grid search 、 Random search 、 Bayesian optimization And so on . Before we introduce the algorithm , It is necessary to make clear which elements the super parameter search algorithm generally includes .
- One is the objective function , That is, the algorithm needs to be maximized / The goal of minimization ;
- The second is the search scope , It is usually determined by the upper and lower limits ;
- Third, other parameters of the algorithm , Such as search step .
(1) The grid search
Grid search is probably the easiest 、 The most widely used super parameter search algorithm . it Determine the optimal value by finding all points in the search range . If you use a larger search range and a smaller step size , Grid search has a high probability to find the global optimal value . However , This search scheme It consumes a lot of computing resources and time , Especially when there are many super parameters to tune . therefore , in application , The grid search method generally uses a wide search range and a large step length first , To find the possible location of the global optimal value ; The search scope and step size will be gradually reduced , To find a more accurate optimal value . This kind of operation scheme can reduce the time and computation required , but Because the objective function is generally nonconvex , So it's likely to miss the global optimum .
(2) Random search
The idea of random search is similar to that of grid search , Just don 't test all the values between the upper and lower bounds anymore , It is Randomly select sample points in the search range . Its theoretical basis is : If the sample set is large enough , Then we can find the global optimal value approximately by random sampling , Or its approximate value . Random search is generally faster than grid search , But like the fast version of grid search , its The result is not guaranteed .
(3) Bayesian Optimization Algorithm
When Bayesian optimization algorithm is looking for the optimal maximum parameter , Used with grid search 、 Random search is a totally different way . Grid search and random search when testing a new point , Ignore the information from the previous point ; and Bayesian optimization algorithm makes full use of the previous information . The Bayesian optimization algorithm learns the shape of the objective function , Find the parameter that makes the objective function improve to the global optimal value .
say concretely , The way it learns the shape of the objective function is
- First, according to the prior distribution , Suppose a collection function ;
- then , Every time a new sample point is used to test the objective function , Use this information to update the prior distribution of the objective function ;
- Last , The algorithm tests the most likely position of the global maximum given by the posterior distribution .
For Bayesian Optimization Algorithm , There's a place to pay attention to , Once a local optimal value is found , It's going to be sampling the area , therefore It is easy to fall into a local optimum . To make up for this defect , Bayesian optimization algorithm will find a balance between exploration and utilization ,“ Explore ” It is to obtain the sampling point in the area that has not been sampled ; and “ utilize ” Sampling is done according to the posterior distribution in the region where the global maximum is most likely to occur .
Over fitting and under fitting problems
In the process of model evaluation and adjustment , Often meet ” Over fitting “ and ” Under fitting “ The situation of . that
problem 1: In the process of model evaluation , What phenomenon does over fitting and under fitting refer to ?
Over fitting refers to the situation that the model is over fitted to the training data , Reflect on the evaluation indicators , It's just that the model performs well in the training set , But it doesn't perform well on test sets and new data .
Under fitting means that the model does not perform well in training and prediction , Reflect on the evaluation indicators , The performance of the model in the training set and the test set is not good .
problem 2: What can be done to reduce the risk of over fitting and under fitting ?
(1) Reduce “ Over fitting ” Risk management approach
- Start with the data , Get more data . Using more training data is the most effective way to solve the over fitting problem . Because more samples can let the model learn more and more effective features , Noise reduction . Of course , It is generally difficult to directly add experimental data , But we can extend the training data through certain rules . such as , On the problem of image classification , Through the translation of the image 、 rotate 、 Expand data by zooming, etc ; Further more , Use generative countermeasure networks to synthesize large amounts of new training data .
- Reduce model complexity . When there is less data , Too complex model is the main factor of over fitting . Reducing the complexity of the model can avoid too much sampling noise . for example : Reduce the number of network layers in the neural network model 、 Number of neurons ; Reduce the depth of the tree in the decision tree model 、 Prune, etc .
- Regularization method . Add some regular constraints to the parameters of the model , For example, add the weight to the loss function . With L2 Take regularization as an example :
C = C 0 + λ 2 n ⋅ ∑ i w i 2 C=C_0+\frac{\lambda}{2n}\cdot \sum_{i}w_i^2 C=C0+2nλ⋅i∑wi2
such , In optimizing the original objective function C 0 C_0 C0 At the same time , It can also avoid the risk of over fitting caused by too large weight . - Integrated learning methods . Integrated learning is the integration of multiple models , The first mock exam is to reduce the risk of over fitting of a single model . Such as Bagging Method .
(2) Reduce “ Under fitting ” Risk management approach
- Add new features . When the features are insufficient or the correlation between the existing features and the sample tag is not strong , The model is prone to under fitting . By digging “ Contextual features ”、“ID Class characteristics ”、“ Combination features ” Wait for new features , Often can achieve better results . In the trend of deep learning , There are many models that can help with Feature Engineering , Such as factorizer 、 Gradient lift decision tree 、Deep-crossing And so on can be a way to enrich features .
- Increase model complexity . The learning ability of simple model is poor , By increasing the complexity of the model, it can make the model have stronger fitting ability . for example , Add a higher-order term... To the linear model , Increase the number of network layers or neurons in the neural network model .
- Reduce the regularization coefficient . Regularization is used to prevent over fitting , When the model is under fitted , It is necessary to reduce the regularization coefficient .
Reference material
[1] 《 Baimian machine learning 》 Chapter two Model to evaluate
边栏推荐
- Using tsne to visualize the similarity of different sentences
- Seata 与三大平台携手编程之夏,百万奖金等你来拿
- The first unlucky person watching eth 2.0
- Sentinel source code analysis Part 7 - sentinel adapter module - Summary
- [recommendation system] concise principle and code implementation of user based collaborative filtering
- Sentinel source code analysis Part 6 - sentinel adapter module Chapter 4 zuul2 gateway
- Quality management of functional modules of MES management system
- HC32M0+ GPIO
- Resizekit2.net size and resolution independent
- Statsmodels notes STL
猜你喜欢

81. 搜索旋转排序数组 II

TP-LINK configure WiFi authentication method for wireless Internet SMS

I learned database at station B (V): DQL exercise
![[535. encryption and decryption of tinyurl]](/img/b7/7748febe30852ca428fe86b045e9ca.png)
[535. encryption and decryption of tinyurl]

MySQL installation steps (detailed)

81. search rotation sort array II

Sentinel source code analysis Part 7 - sentinel adapter module - Summary

Preliminary understanding of NVIDIA Jetson nano

The first technology podcast month will be launched soon

R language linear regression model fitting diagnosis outliers analysis of domestic gas consumption and calorie examples with self-test questions
随机推荐
TP-LINK configure WiFi authentication method for wireless Internet SMS
如何统一项目中包管理器的使用?
Varnish 基础概览1
I learned database at station B (V): DQL exercise
cookie加密8
Varnish 基础概览7
[proteus simulation] 8-bit port detection 8 independent keys
数字垃圾是什么?跟随世界第一位AI艺术家,探索元碳艺术
Solution to webkitformboundaryk in post request
Seata et les trois plateformes travaillent main dans la main pour programmer Summer, un million de bonus vous attend
Machinery -- nx2007 (UG) finite element analysis tutorial 1 -- simple object
The 8th "Internet +" competition - cloud native track invites you to challenge
Storage engine analysis
[Thesis Writing] English thesis writing guide
Interview summary
首届技术播客月开播在即
Cookie加密12
ResizeKit2.NET大小和分辨率独立
Stringredistemplate disconnects and the value disappears
Mysql 监控1