当前位置：网站首页>mixup_ ratio

mixup_ ratio

2022-07-28 06:22:00 【A tavern on the mountain】

Catalog

1. brief introduction

2. background

3.mixup_ratio Main work

4. understand

5. Discuss

6. Calculation loss when , About label The choice of

7. Several data enhancement methods

8. Discussion of data enhancement

mixup_ratio Individuals can be understood as mixing two samples proportionally , Generate new samples , It is a way of data enhancement .

1. brief introduction

Large deep neural network , Huge loss of memory and Counter samples （ interfere ） The sensitivity of has not been ideal . Essentially ,mixup Convex combination of pairwise samples and their labels （convex combinations） Train the neural network . To do so ,mixup The normalized neural network enhances the linear expression between training samples .mixup It can improve the current most advanced neural network architecture Generalization ability （ Can be improved by more than one point ）.mixup can Reduce the memory of wrong labels , Increase robustness against samples , And be able to stabilize the training process of the generated countermeasure network .

2. background

Large scale deep neural networks have made great breakthroughs in recent years , But it has two commonalities ：

First , Deep network training to Minimize the average error of its training data , It is called empirical risk minimization （Empirical Risk Minimization,ERM） principle （Vapnik On 1998 in ）; secondly , These are currently the most advanced The size of neural network is linear with the number of training samples , Large scale network models are suitable for large-scale data sets .

The conflict lies in , Classical machine learning theory tells us , As long as the learning machine （ Such as neural network ） The scale of does not increase with the increase of the number of training data , that ERM The convergence of is guaranteed （ There is no need for too complex models , The distribution of fitting data is convergent , But if the complexity of the model is high , It's easy to fit ）. among , The size of the learning machine is determined by the number of parameters , Or its VC Complexity （Harvey And other people in 2017 in ） To measure . This contradiction challenges ERM The adaptability of the method in the current neural network training .

One side , Even in the case of strong regularization , Or in the classification problem of random assignment of labels ,ERM It also allows large-scale neural networks to memory （ Instead of generalizing ） Training data .（ The scale of the model is large , The ability to fit data is very strong , There is the problem of over fitting ）

On the other hand , Neural networks use ERM Methods after training , stay Samples outside the training distribution （ Counter samples ： An input sample formed by deliberately adding subtle interference to a data set ） The prediction results will be greatly changed during the verification on .（ After all, it's something learned from two distributions ）

This evidence shows that , The test distribution is slightly different from the training data distribution ,ERM The method has no good interpretation and generalization performance （ The distribution of training data can not well represent the distribution of test data ）. thus , Data enhancement method （Simard et al., 1998）, Train data and... In simple but different samples Vicinal Risk Minimization( VRM) The principle of minimizing domain risk Proposed . stay VRM in , Professional knowledge is needed to describe the neighborhood of each sample in the training data , So that we can From training samples In the neighborhood Extract additional virtual samples to expand the support for training distribution . Data enhancement can improve generalization , But this process depends on data sets , And it requires expertise . secondly , Data enhancement assumes that the samples in the field are the same , And the domain relationship between different classes and samples is not modeled .

3.mixup_ratio Main work

Inspired by these questions , The author proposes a simple and data independent data enhancement method , It's called mixup . In short ,mixup Build a virtual training sample .

among ,(xi,yi) and (xj,yj) It is two samples randomly selected from the training data , And λ∈[0,1]. therefore ,mixup adopt Combined with a priori knowledge , That is, the linear interpolation of the eigenvector should lead to the linear interpolation of the related labels （ For example, a picture of a cat 60% And a picture of a dog 40% Synthesize new image samples , Then the label corresponding to a new sample should be 60% My cat and 40% The dog of ）, To expand the training distribution . The weight λ obey Beta Distribution ,λ∼Beta(α,α),α∈(0,∞).

mixup Hyperparameters α Controlled in feature - Intensity of interpolation between target vectors , When α→0 Restore to ERM principle .

Figure 1b Shows mixup stay A smoother transition line is provided between classes To estimate the uncertainty . The interface comparison in the left figure “ hard ”, The feeling of either black or white . Right mixup There is a soft transition zone . Tolerate mistakes , It weakens the memory of wrong labels .（ Be similar to SVM Soft spacing in ？ Penalty stiffness ）

4. understand

mixup Neighborhood distribution It can be understood as a Data enhancement mode , it Let the model deal with The area between samples is linear . This linear modeling reduces the inadaptability when predicting data other than training samples （ Increased robustness , Enhanced adaptability ）. Starting from the principle of Occam razor , Linearity is a good inductive bias , Because it is one of the simplest possible behaviors .

chart （a） Shows mixup This leads to a linear transformation of the decision boundary from one class to another , Provides a Smoother uncertainty estimates . chart （b） Displayed in CIFAR-10 Use... On datasets mixup and ERM The average performance of the two neural network models trained by the two methods . The two models have the same structure , Use the same training process , Evaluate on the same sample randomly sampled from the training data . use mixup The training model is more stable in predicting the data between training data . （ Yes Continuous discrete sample space , Improve the smoothness in the neighborhood ）.

5. Discuss

With α An increase in , Training errors in real data increase , At the same time, the generalization gap is reduced . This proves a hypothesis of the author ,mixup You can implicitly control the complexity of the model . However , The author did not find a good way In deviation - Find the best position in the balance of variance . for instance , stay CIFAR-10 In training , Even when α→∞ when , The training error on real data will be very low . However, in ImageNet In the classification task , The training error of real data is α→∞ There will be a significant rise . Based on the author in ImageNet and Google commands Experiments using different network structures on , Find increasing network capacity , It can make the training error larger α Value sensitivity decreases , This gives mixup Bring more advantages .

6. Calculation loss when , About label The choice of

label It's using one-hot vector code , It can be understood as right k Each class of categories gives the probability that the sample belongs to that class . After weighting, it becomes "two-hot", That is to say Samples belong to both categories before mixing . The pseudo code in this paper uses this kind of label calculation loss.

For example, the predicted value [0,0.55,0.35,0.05,0.05]—> label [0,0.6,0.4,0,0]

Another perspective is non mixing label, It's about using Weighted input in two label Calculate separately cross-entropy loss, Finally put two loss Weighted as final loss. because cross-entropy loss The nature of , This practice is similar to label Linear weighting is equivalent .

7. Several data enhancement methods

mixup Add the pixel mixture in a certain proportion

cutout Crop out some pixels of the picture , The advantages are similar to MAE Medium mask

cutmix To cut a part from a picture , Splice to another picture . From the above table, this data enhancement method is the best .

8. Discussion of data enhancement

Teacher zhanghongyi is right Data to enhance （data augmentation） And their views on mixup The explanation of ：

data augmentation It cannot be simply understood as increasing training data, Nor can it be simply understood as controlling the complexity of the model , It's a combination of two effects . Consider changes commonly used in image recognition aspect ratio（ Aspect ratio ,height and width Do scaling in dimension ） do data augmentation Methods , Although the generated image is similar to the real image , But it doesn't come from data distribution, Not its independent identically distributed IID(independent and identically distributed) Sampling . And the classic supervised learning And the basic assumption of statistical learning theory is that the training set and the test set are data distribution Of IID Sampling , So this is not an increase in the classical sense training data. These synthetic training data The role of , The popular explanation is “ Enhance the model's response to certain transformations invariance（ invariance 、 Adaptability or generalization ）”. The reverse of this sentence , It is often mentioned in machine learning “ Reduce the estimated variance（ Variable ）”, That is to control the complexity of the model . It should be noted that ,L2 Regularization 、dropout And so on are also controlling the complexity of the model , But they don't consider the distribution of the data itself , and data augmentation Belong to more resourceful （ Consider sample distribution ） Methods of controlling model complexity .