当前位置:网站首页>mixup_ ratio
mixup_ ratio
2022-07-28 06:22:00 【A tavern on the mountain】
Catalog
6. Calculation loss when , About label The choice of
7. Several data enhancement methods
8. Discussion of data enhancement
mixup_ratio Individuals can be understood as mixing two samples proportionally , Generate new samples , It is a way of data enhancement .
1. brief introduction
Large deep neural network , Huge loss of memory and Counter samples ( interfere ) The sensitivity of has not been ideal . Essentially ,mixup Convex combination of pairwise samples and their labels (convex combinations) Train the neural network . To do so ,mixup The normalized neural network enhances the linear expression between training samples .mixup It can improve the current most advanced neural network architecture Generalization ability ( Can be improved by more than one point ).mixup can Reduce the memory of wrong labels , Increase robustness against samples , And be able to stabilize the training process of the generated countermeasure network .
2. background
Large scale deep neural networks have made great breakthroughs in recent years , But it has two commonalities :
First , Deep network training to Minimize the average error of its training data , It is called empirical risk minimization (Empirical Risk Minimization,ERM) principle (Vapnik On 1998 in ); secondly , These are currently the most advanced The size of neural network is linear with the number of training samples , Large scale network models are suitable for large-scale data sets .
The conflict lies in , Classical machine learning theory tells us , As long as the learning machine ( Such as neural network ) The scale of does not increase with the increase of the number of training data , that ERM The convergence of is guaranteed ( There is no need for too complex models , The distribution of fitting data is convergent , But if the complexity of the model is high , It's easy to fit ). among , The size of the learning machine is determined by the number of parameters , Or its VC Complexity (Harvey And other people in 2017 in ) To measure . This contradiction challenges ERM The adaptability of the method in the current neural network training .
One side , Even in the case of strong regularization , Or in the classification problem of random assignment of labels ,ERM It also allows large-scale neural networks to memory ( Instead of generalizing ) Training data .( The scale of the model is large , The ability to fit data is very strong , There is the problem of over fitting )
On the other hand , Neural networks use ERM Methods after training , stay Samples outside the training distribution ( Counter samples : An input sample formed by deliberately adding subtle interference to a data set ) The prediction results will be greatly changed during the verification on .( After all, it's something learned from two distributions )
This evidence shows that , The test distribution is slightly different from the training data distribution ,ERM The method has no good interpretation and generalization performance ( The distribution of training data can not well represent the distribution of test data ). thus , Data enhancement method (Simard et al., 1998), Train data and... In simple but different samples Vicinal Risk Minimization( VRM) The principle of minimizing domain risk Proposed . stay VRM in , Professional knowledge is needed to describe the neighborhood of each sample in the training data , So that we can From training samples In the neighborhood Extract additional virtual samples to expand the support for training distribution . Data enhancement can improve generalization , But this process depends on data sets , And it requires expertise . secondly , Data enhancement assumes that the samples in the field are the same , And the domain relationship between different classes and samples is not modeled .
3.mixup_ratio Main work
Inspired by these questions , The author proposes a simple and data independent data enhancement method , It's called mixup . In short ,mixup Build a virtual training sample .

among ,(xi,yi) and (xj,yj) It is two samples randomly selected from the training data , And λ∈[0,1]. therefore ,mixup adopt Combined with a priori knowledge , That is, the linear interpolation of the eigenvector should lead to the linear interpolation of the related labels ( For example, a picture of a cat 60% And a picture of a dog 40% Synthesize new image samples , Then the label corresponding to a new sample should be 60% My cat and 40% The dog of ), To expand the training distribution . The weight λ obey Beta Distribution ,λ∼Beta(α,α),α∈(0,∞).

mixup Hyperparameters α Controlled in feature - Intensity of interpolation between target vectors , When α→0 Restore to ERM principle .

Figure 1b Shows mixup stay A smoother transition line is provided between classes To estimate the uncertainty . The interface comparison in the left figure “ hard ”, The feeling of either black or white . Right mixup There is a soft transition zone . Tolerate mistakes , It weakens the memory of wrong labels .( Be similar to SVM Soft spacing in ? Penalty stiffness )
4. understand
mixup Neighborhood distribution It can be understood as a Data enhancement mode , it Let the model deal with The area between samples is linear . This linear modeling reduces the inadaptability when predicting data other than training samples ( Increased robustness , Enhanced adaptability ). Starting from the principle of Occam razor , Linearity is a good inductive bias , Because it is one of the simplest possible behaviors .

chart (a) Shows mixup This leads to a linear transformation of the decision boundary from one class to another , Provides a Smoother uncertainty estimates . chart (b) Displayed in CIFAR-10 Use... On datasets mixup and ERM The average performance of the two neural network models trained by the two methods . The two models have the same structure , Use the same training process , Evaluate on the same sample randomly sampled from the training data . use mixup The training model is more stable in predicting the data between training data . ( Yes Continuous discrete sample space , Improve the smoothness in the neighborhood ).
5. Discuss
With α An increase in , Training errors in real data increase , At the same time, the generalization gap is reduced . This proves a hypothesis of the author ,mixup You can implicitly control the complexity of the model . However , The author did not find a good way In deviation - Find the best position in the balance of variance . for instance , stay CIFAR-10 In training , Even when α→∞ when , The training error on real data will be very low . However, in ImageNet In the classification task , The training error of real data is α→∞ There will be a significant rise . Based on the author in ImageNet and Google commands Experiments using different network structures on , Find increasing network capacity , It can make the training error larger α Value sensitivity decreases , This gives mixup Bring more advantages .
6. Calculation loss when , About label The choice of
label It's using one-hot vector code , It can be understood as right k Each class of categories gives the probability that the sample belongs to that class . After weighting, it becomes "two-hot", That is to say Samples belong to both categories before mixing . The pseudo code in this paper uses this kind of label calculation loss.
For example, the predicted value [0,0.55,0.35,0.05,0.05]—> label [0,0.6,0.4,0,0]
Another perspective is non mixing label, It's about using Weighted input in two label Calculate separately cross-entropy loss, Finally put two loss Weighted as final loss. because cross-entropy loss The nature of , This practice is similar to label Linear weighting is equivalent .
7. Several data enhancement methods

mixup Add the pixel mixture in a certain proportion
cutout Crop out some pixels of the picture , The advantages are similar to MAE Medium mask
cutmix To cut a part from a picture , Splice to another picture . From the above table, this data enhancement method is the best .
8. Discussion of data enhancement
Teacher zhanghongyi is right Data to enhance (data augmentation) And their views on mixup The explanation of :
data augmentation It cannot be simply understood as increasing training data, Nor can it be simply understood as controlling the complexity of the model , It's a combination of two effects . Consider changes commonly used in image recognition aspect ratio( Aspect ratio ,height and width Do scaling in dimension ) do data augmentation Methods , Although the generated image is similar to the real image , But it doesn't come from data distribution, Not its independent identically distributed IID(independent and identically distributed) Sampling . And the classic supervised learning And the basic assumption of statistical learning theory is that the training set and the test set are data distribution Of IID Sampling , So this is not an increase in the classical sense training data. These synthetic training data The role of , The popular explanation is “ Enhance the model's response to certain transformations invariance( invariance 、 Adaptability or generalization )”. The reverse of this sentence , It is often mentioned in machine learning “ Reduce the estimated variance( Variable )”, That is to control the complexity of the model . It should be noted that ,L2 Regularization 、dropout And so on are also controlling the complexity of the model , But they don't consider the distribution of the data itself , and data augmentation Belong to more resourceful ( Consider sample distribution ) Methods of controlling model complexity .
边栏推荐
- Summary of common WAF interception pages
- CString to char[] function
- Research on threat analysis and defense methods of deep learning data theft attack in data sandbox mode
- Web scrolling subtitles (marquee example)
- 将GrilView中的数据转换成DataTable
- Triode design, understanding saturation, linear region and cut-off region
- Web滚动字幕(MARQUEE示例)
- 端接电阻详解 信号完整系列 硬件学习笔记7
- Surge impact immunity experiment (surge) -emc series Hardware Design Notes 6
- Install visual studio 2019 steps and vs2019 offline installation package on win7
猜你喜欢

EXFO 730c optical time domain reflectometer only has IOLm optical eye to upgrade OTDR (open OTDR permission)

Triode design, understanding saturation, linear region and cut-off region

Model inversion attacks that exploit confidence information on and basic countermeasures

1、 Amd - openvino environment configuration

详解爬电距离和电气间隙

Create a basic report using MS chart controls

ICC2(一)Preparing the Design

硬件电路设计学习笔记1--温升设计

Low power design isolation cell

短跳线DSX-8000测试正常,但是DSX-5000测试无长度显示?
随机推荐
Distinguishing PCB quality by color is a joke in itself
mixup_ratio
Fluke fluke aircheck WiFi tester cannot configure file--- Ultimate solution experience
在win7 上安装 Visual Studio 2019 步骤 及 vs2019离线安装包
Web scrolling subtitles (marquee example)
WebService error maximum message size quota for incoming messages (65536) has been exceeded
set_case_analysis
The short jumper dsx-8000 test is normal, but the dsx-5000 test has no length display?
Communication between DSP and FPGA
Best practices to ensure successful deployment of Poe devices
Summary of command injection bypass methods
What is the AEM testpro cv100 and fluke dsx-8000 of category 8 network cable tester?
4、 Model optimizer and inference engine
Uniapp problem: "navigationbartextstyle" error: invalid prop: custom validator check failed for prop "Navigator
Perl入门学习(八)子程序
浅谈FLUKE光缆认证?何为CFP?何为OFP?
IMS-FACNN(Improved Multi-Scale Convolution Neural Network integrated with a Feature Attention Mecha
51 single chip microcomputer independent key linkage nixie tube LED buzzer
论福禄克DTX-1800如何测试CAT7网线?
福禄克DSX2-5000 网络线缆测试仪为什么每年都要校准一次?