当前位置:网站首页>[machine learning Q & A] data sampling and model verification methods, hyperparametric optimization, over fitting and under fitting problems
[machine learning Q & A] data sampling and model verification methods, hyperparametric optimization, over fitting and under fitting problems
2022-06-30 01:25:00 【Sickle leek】
Data sampling and model validation methods 、 Hyperparametric optimization and over fitting and under fitting problems
- Sample data sampling and model validation methods
- problem 1: In the process of model evaluation , What are the main verification methods , What are their advantages and disadvantages ?
- problem 2: In the process of bootstrap sampling , Yes n A sample of n Second self-service sampling , When n As we go to infinity , In the end, how many pieces of data have never been selected ?
- Super parameter tuning
- Over fitting and under fitting problems
- Reference material
Sample data sampling and model validation methods
In machine learning , The samples are usually divided into training set and test set , The training set is used to train the model , Test sets are used to evaluate models . In the process of sample division and model verification , There are different sampling and verification methods . that
problem 1: In the process of model evaluation , What are the main verification methods , What are their advantages and disadvantages ?
(1)Holdout test :Holdout test
It is the simplest and most direct verification method , It randomly divides the original sample set into Training set
and Verification set
Two parts .
For example , For a click through rate prediction model , Let's take the sample according to 70%~30% The proportion is divided into two parts ,70% Samples for model training .30% For model validation , Including drawing ROC curve , Calculate the accuracy rate and recall rate to evaluate the model performance .
Holdout The shortcomings of the test are obvious : namely The final evaluation index calculated on the verification set has a great relationship with the original group . To eliminate this randomness , The researchers introduced “ Cross examination
”.
(2) Cross examination k-fold Cross examination
: First, divide all the samples into k k k Sample subsets of equal size . Traverse this one by one k k k A subset of , Take the current subset as the validation set every time , All the other subsets as training sets , Model training and evaluation . Finally, put k k k The average value of the times' evaluation index is taken as the final evaluation index . In a practical experiment , k k k Frequent value 10. Leave a verification
: Every time I stay 1 Samples as validation set , All other samples are used as training sets , The total number of samples is n, According to this n Samples for traversal , Conduct n Time verification , Then average the evaluation index to the final evaluation index .
In the case of a large number of samples , It takes a lot of time to leave a validation method . in fact , To keep a verification is to keep p p p Special case of verification , leave p p p Verification is to stay... Every time p p p Samples as validation set , from n Of the elements p p p The elements are C n p C_n^p Cnp Possibilities , So its time cost is much higher than leaving a verification , So I seldom use .
(3) Self help law
Whether it's Holdout Test or cross test , Are based on partition Training set and Test set Method for model evaluation . then , When the sample size is small , Dividing the sample set will further reduce the training set , This may affect the effect of model training . Is there a verification method that can maintain the sample size of the training set ? Self help method can better solve this problem ;
Self help law
Inspection method based on self-service sampling method . For a total of n A sample set of , Conduct n Time Put it back Random sampling , The magnitude is n Training set of .n During subsampling , Some samples will be repeated , Some samples have not been taken , Take these samples that have not been sampled as the validation set , Model validation , This is the self-help verification process .
problem 2: In the process of bootstrap sampling , Yes n A sample of n Second self-service sampling , When n As we go to infinity , In the end, how many pieces of data have never been selected ?
The probability that a sample is not selected in a sampling process is ( 1 − 1 n ) (1-\frac{1}{n}) (1−n1),n The probability that every sampling is successful is ( 1 − 1 n ) n (1-\frac{1}{n})^n (1−n1)n. When n As we go to infinity , The probability of lim x → ∞ ( 1 − 1 n ) n \lim_{x \to \infty} (1-\frac{1}{n})^n limx→∞(1−n1)n. According to the important limit : lim n → ∞ ( 1 + 1 n ) n = e \lim_{n\to \infty}(1+\frac{1}{n})^n=e limn→∞(1+n1)n=e, So there is
lim x → ∞ ( 1 − 1 n ) n = lim n → ∞ ( n − 1 n ) n = lim n → ∞ ( 1 n n − 1 ) n = lim n → ∞ 1 ( 1 + 1 n − 1 ) n = 1 lim n → ∞ ( 1 + 1 n − 1 ) n − 1 ⋅ 1 lim n → ∞ ( 1 + 1 n − 1 ) = 1 e ≈ 0.368 \lim_{x \to \infty} (1-\frac{1}{n})^n=\lim_{n \to \infty} (\frac{n-1}{n})^n=\lim_{n\to \infty}(\frac{1}{\frac{n}{n-1}})^n=\lim_{n \to \infty}\frac{1}{(1+\frac{1}{n-1})^n}=\frac{1}{\lim_{n\to \infty} (1+\frac{1}{n-1})^{n-1}}\cdot \frac{1}{\lim_{n\to \infty}(1+\frac{1}{n-1})}=\frac{1}{e}\approx 0.368 x→∞lim(1−n1)n=n→∞lim(nn−1)n=n→∞lim(n−1n1)n=n→∞lim(1+n−11)n1=limn→∞(1+n−11)n−11⋅limn→∞(1+n−11)1=e1≈0.368
therefore , When the number of samples is large , There are about 36.8%
A sample of has never been selected , Can be used as a validation set .
Super parameter tuning
problem 1: What are the tuning methods for hyper parameters ?
For super parameter tuning , We usually use The grid search
、 Random search
、 Bayesian optimization
And so on . Before we introduce the algorithm , It is necessary to make clear which elements the super parameter search algorithm generally includes .
- One is the objective function , That is, the algorithm needs to be maximized / The goal of minimization ;
- The second is the search scope , It is usually determined by the upper and lower limits ;
- Third, other parameters of the algorithm , Such as search step .
(1) The grid search
Grid search is probably the easiest 、 The most widely used super parameter search algorithm . it Determine the optimal value by finding all points in the search range . If you use a larger search range and a smaller step size , Grid search has a high probability to find the global optimal value . However , This search scheme It consumes a lot of computing resources and time , Especially when there are many super parameters to tune . therefore , in application , The grid search method generally uses a wide search range and a large step length first , To find the possible location of the global optimal value ; The search scope and step size will be gradually reduced , To find a more accurate optimal value . This kind of operation scheme can reduce the time and computation required , but Because the objective function is generally nonconvex , So it's likely to miss the global optimum .
(2) Random search
The idea of random search is similar to that of grid search , Just don 't test all the values between the upper and lower bounds anymore , It is Randomly select sample points in the search range . Its theoretical basis is : If the sample set is large enough , Then we can find the global optimal value approximately by random sampling , Or its approximate value . Random search is generally faster than grid search , But like the fast version of grid search , its The result is not guaranteed .
(3) Bayesian Optimization Algorithm
When Bayesian optimization algorithm is looking for the optimal maximum parameter , Used with grid search 、 Random search is a totally different way . Grid search and random search when testing a new point , Ignore the information from the previous point ; and Bayesian optimization algorithm makes full use of the previous information . The Bayesian optimization algorithm learns the shape of the objective function , Find the parameter that makes the objective function improve to the global optimal value .
say concretely , The way it learns the shape of the objective function is
- First, according to the prior distribution , Suppose a collection function ;
- then , Every time a new sample point is used to test the objective function , Use this information to update the prior distribution of the objective function ;
- Last , The algorithm tests the most likely position of the global maximum given by the posterior distribution .
For Bayesian Optimization Algorithm , There's a place to pay attention to , Once a local optimal value is found , It's going to be sampling the area , therefore It is easy to fall into a local optimum . To make up for this defect , Bayesian optimization algorithm will find a balance between exploration and utilization ,“ Explore ” It is to obtain the sampling point in the area that has not been sampled ; and “ utilize ” Sampling is done according to the posterior distribution in the region where the global maximum is most likely to occur .
Over fitting and under fitting problems
In the process of model evaluation and adjustment , Often meet ” Over fitting “ and ” Under fitting “ The situation of . that
problem 1: In the process of model evaluation , What phenomenon does over fitting and under fitting refer to ?
Over fitting refers to the situation that the model is over fitted to the training data , Reflect on the evaluation indicators , It's just that the model performs well in the training set , But it doesn't perform well on test sets and new data .
Under fitting means that the model does not perform well in training and prediction , Reflect on the evaluation indicators , The performance of the model in the training set and the test set is not good .
problem 2: What can be done to reduce the risk of over fitting and under fitting ?
(1) Reduce “ Over fitting ” Risk management approach
- Start with the data , Get more data . Using more training data is the most effective way to solve the over fitting problem . Because more samples can let the model learn more and more effective features , Noise reduction . Of course , It is generally difficult to directly add experimental data , But we can extend the training data through certain rules . such as , On the problem of image classification , Through the translation of the image 、 rotate 、 Expand data by zooming, etc ; Further more , Use generative countermeasure networks to synthesize large amounts of new training data .
- Reduce model complexity . When there is less data , Too complex model is the main factor of over fitting . Reducing the complexity of the model can avoid too much sampling noise . for example : Reduce the number of network layers in the neural network model 、 Number of neurons ; Reduce the depth of the tree in the decision tree model 、 Prune, etc .
- Regularization method . Add some regular constraints to the parameters of the model , For example, add the weight to the loss function . With L2 Take regularization as an example :
C = C 0 + λ 2 n ⋅ ∑ i w i 2 C=C_0+\frac{\lambda}{2n}\cdot \sum_{i}w_i^2 C=C0+2nλ⋅i∑wi2
such , In optimizing the original objective function C 0 C_0 C0 At the same time , It can also avoid the risk of over fitting caused by too large weight . - Integrated learning methods . Integrated learning is the integration of multiple models , The first mock exam is to reduce the risk of over fitting of a single model . Such as Bagging Method .
(2) Reduce “ Under fitting ” Risk management approach
- Add new features . When the features are insufficient or the correlation between the existing features and the sample tag is not strong , The model is prone to under fitting . By digging “ Contextual features ”、“ID Class characteristics ”、“ Combination features ” Wait for new features , Often can achieve better results . In the trend of deep learning , There are many models that can help with Feature Engineering , Such as factorizer 、 Gradient lift decision tree 、Deep-crossing And so on can be a way to enrich features .
- Increase model complexity . The learning ability of simple model is poor , By increasing the complexity of the model, it can make the model have stronger fitting ability . for example , Add a higher-order term... To the linear model , Increase the number of network layers or neurons in the neural network model .
- Reduce the regularization coefficient . Regularization is used to prevent over fitting , When the model is under fitted , It is necessary to reduce the regularization coefficient .
Reference material
[1] 《 Baimian machine learning 》 Chapter two Model to evaluate
边栏推荐
- Online sql to CSV tool
- The listing of Symantec electronic sprint technology innovation board: it plans to raise 623million yuan, with a total of 64 patent applications
- RubyMine开发工具,重构和意图操作
- Kubernetes 核心对象概览详解
- 首届·技术播客月开播在即
- Solution to webkitformboundaryk in post request
- Seata and the three platforms are working together in the summer of programming. Millions of bonuses are waiting for you
- Varnish 基础概览1
- Sentinel source code analysis Part 6 - sentinel adapter module Chapter 4 zuul2 gateway
- VIM编辑器常用指令
猜你喜欢
Equivalence class partition method for test case design method
2020-12-03
Transaction summary on June 25, 2022
[deep learning compilation] operator compilation IR conversion
【论文写作】英文论文写作指南
The Web3 era is coming? Inventory of five Web3 representative projects | footprint analytics
Using tsne to visualize the similarity of different sentences
Kubernetes 核心对象概览详解
postman 之接口关联
Online text digit recognition list summation tool
随机推荐
cookie加密9
ES6 synchronous asynchronous execution and block level scope
cookie加密11
Understanding of int argc, char * * argv in C language main function
Machinery -- nx2007 (UG) finite element analysis tutorial 2 -- assembly
VIM编辑器常用指令
Error reporting in Luban H5 installation
Varnish 基础概览3
In depth analysis of a large number of clos on the server_ The root of wait
Varnish 基础概览7
cookie加密8
Is the course of digging money reliable and safe to open an account?
ES6 one line code for array de duplication
Seata 与三大平台携手编程之夏,百万奖金等你来拿
Precautions for postoperative fundus hemorrhage / / must see every day
我,33岁,字节跳动测试开发,揭开北京“测试岗”的真实收入
The Web3 era is coming? Inventory of five Web3 representative projects | footprint analytics
Pytroch Learning Notes 6: NN network layer convolution layer
Seata and the three platforms are working together in the summer of programming. Millions of bonuses are waiting for you
Quality management of functional modules of MES management system