当前位置:网站首页>Inventory | ICLR 2022 migration learning, visual transformer article summary

Inventory | ICLR 2022 migration learning, visual transformer article summary

2022-06-11 04:54:00 Shenlan Shenyan AI

ICLR 2022 Some of the higher scores are more interesting , Or partial DG Of paper.

01 Oral——A Fine-Grained Analysis on Distribution Shift

Article from ICLR 2022 Oral: A Fine-Grained Analysis on Distribution Shift

Model for distribution shifts The robustness of is very important in deployment , Domain generalization (domain generalization) It is the field that specializes in this problem . although DG A lot of research work has emerged , But the vast majority of the work is to use their own methods in some common benchmark Deploy on , Rely on accuracy to verify the generalization ability of the model . How to define accurately distribution shift, And how to improve the robustness of the measurement model of the system is still unsolved . This article tries to answer the above two questions .

The author mainly tests four kinds of methods :

1. Network structure ResNet18, ResNet50, ResNet101,ViT as well as MLP.

2. Heuristic Data Augmentation. Weighted resampling will cause the sample to be reused many times , Lead to over fitting . To reduce over fitting , Heuristic data enhancement can be used to increase training data , for example color jitter. Such methods include AugMix without JSD,RandAugment,AutoAugment etc. .

3. Learned Data Augmentation: Training a conditional generation model from the source domain , New sample points are obtained by sampling . Such methods include CYCLEGAN.

4. Domain generalization Method : Including the common invariant representation. The representative method is as follows IRM,DeepCORAL,DANN,MixUP etc. .

5. Weighted resampling. That is to say, each data point in the source domain is given a weight during training . This method is generalized in the domain , The long tail distribution has a good effect . The representative methods are JTT,BN-Adapt.

6. Representation learning. Here we mainly choose β-VAE This kind of disentanglement Methods .

The author defines three different distribution shifts as shown in the figure below :

1. Spurious correlation: There are some relations between features and labels in training distribution , These relationships do not exist in the test set . Pictured above (a) Shapes and colors in .

2. Low-data drift: Uneven distribution of attribute values during training , During the test, the distribution is even

3. Unseen data shift: Some attribute values will not appear during training , But during the test, there will be .

In addition, there are two data set related shift namely Label noise And the size of the data set .

The main conclusions are as follows :

1. Although the current method can be improved ERM, But there is no method that always performs best .

2. Pre training in various data sets and distribution shift Performed quite well in all cases , Unless the test data set is very large and domain specific , For example, medical images CAMELYON17 Data sets .

3. Heuristic augmentation Not always good for generalization performance , This is related to data sets , A suitable data enhancement method is very important , For example, for rotatedMNIST Data sets , Image rotation is the best enhancement method .

4. Learned data augmentation It can stably bring performance gain in all cases .

5. Domain generalization The method can bring gain on a specific data set , Especially for Low-data drift and Unseen data shift For both offsets , But generally speaking, the improvement is not as good as the heuristic data enhancement .

6. The optimal algorithm is different under different conditions , That is to say, there is no so-called most powerful algorithm at present .

The author also gives some methods to improve generalization performance in practical applications tips:

1. Heuristic data enhancement is simple , You can try to choose different combinations several times .

2. If heuristic data enhancement does not work, You can choose to use CYCLE-GAN Such technologies to learn new data .

3. in general , Pre training has been found to be useful for learning robust representations ,

4. Current domain generalization methods ,disentanglement The method of and reweighting bring limited improvement and are not useful for all data sets . The more complex the method, the more difficult it is generalize To other datasets .

02Oral——Fine-Tuning Distorts Pretrained Features and Underperforms Out-of-Distribution

It is generally considered that there are two methods to deploy the pre training model to the downstream task ,fine-tuning( fine-tuning ) and linear probing, The former will update all parameters of the model , The latter only updates the last linear layer . It is generally believed that fine tuning will lead to better intra distribution accuracy , Better generalization . For example, this article is in Breeds-Living17, Breeds-Entity30, DomainNet, CIFAR→STL, CIFAR10.1, FMoW 6 Experiments were carried out on two data sets , Compared with linear probing, Fine tuning is generally averaged over the distribution 2% Performance improvement of , however OOD There are 6% Performance degradation of .

In this paper, we theoretically analyze the problem of the two-layer linear network with fine tuning and parameterization tradeoff, Describes how fine tuning distorts high-quality pre training features , Which leads to OOD Low accuracy . Based on this theory , The following figure is proposed in this paper (c) The improvement strategy of . A two-stage training strategy , take linear probing As the initialization phase before fine tuning . Such a simple improvement is in ID and OOD Has achieved a very good performance improvement (1% better ID, 8% better OOD).

 03 Spotlight——Towards a Unified View of Parameter-Efficient Transfer Learning

Fine tuning large pre training language models in downstream tasks has become NLP Common learning paradigms . The traditional method of fine tuning all parameters of the pre training model is too difficult , Because the parameter quantity is too large . Recent work has proposed a variety of efficient parameter transfer learning methods , These methods adjust only a small amount ( additional ) Parameters for better performance , There are the following categories :

  1. adapter tuning: Will be called adapter A small neural module is inserted into each layer of the pre training network , And only adapter Trained in fine tuning ( Upper figure a)

  2. prefix tuning And prompt tuning: In the input token Add extra multiple trainable token, Only train these when fine tuning token( Upper figure b).

  3. LoRA : stay transformer A trainable low rank matrix is injected into the layer to approximate the weight update .( Upper figure c)

These methods have considerable performance compared with the overall fine-tuning of different task sets , Usually it can't be found by updating 1% The original model parameters . In addition to saving parameters , Parameter efficient tuning can also quickly adapt to new tasks without catastrophic forgetting , And usually in OOD The evaluation shows excellent robustness .

However , At present, little is known about the important factors for the success of these parameter efficient tuning methods , The connection between them is still unclear . This article aims to answer three questions : (1) How these methods relate ? (2) Whether these methods share design elements that are critical to their effectiveness , What are they ? (3) Whether the active ingredients of each method can be transferred to other methods , So as to produce more effective variants ?

In this paper, we first find adapter and prefix tuning Contact on , On this basis, a unified framework is designed , This unified framework can, to a certain extent, reveal the success factors of efficient parameter tuning , And two variants have been bred , Include text summaries in 、 Machine translation (MT)、 Four aspects of text classification and general language understanding NLP Experiments on the benchmark show that , The proposed method uses fewer parameters than the existing methods , At the same time, it is more effective .

04 Spotlight——How Do Vision Transformers Work?

In computer vision ,multi-head self-attention(MSAs) Great success has been achieved , But the reason is not clear , This article provides different explanations to help us understand Vision Transformers (ViTs) Excellent characteristics of . Firstly, the excellent characteristics found in this paper are summarized as follows :

  1. MSAs Not only improve the accuracy , And by making the lost landscape Flattening , Improved generalization .

  2. MSAs And convolution Convs Show opposite behavior . for example ,MSAs It's a low-pass filter , and Convs It's a high pass filter .

  3. Multilayer neural networks behave like a series of small individual models . Besides , The last stage of MSAs Plays a key role in Forecasting .

05 Spotlight——On Predicting Generalization using GANs

Generalize boundaries for deep networks (generalization bounds) The purpose of this study is to provide a method to predict test errors using only training data sets and network parameters . Although generalization boundaries can provide a lot about architectural design 、 Training algorithm, etc intuition, But for now bound Very few good predictions can be made of actual test errors . The current paper examines a simple idea : Test whether the error can be predicted by using synthetic data , That is, using the generative countermeasure network trained on the same training data set (GAN) Data generated ? After studying several GAN Model and architecture , This paper finds that the GANs, Test errors can be predicted , Without any additional hyperparametric tuning .

At present, the upper and lower bounds of generalization error in most papers can be summarized as follows :

here S It's the training data set ,C Is a measure of complexity , This type of bound It is generally very loose and even has little relationship with generalization . This has stimulated more principled empirical studies on the effectiveness of generalized boundaries .

This article explores a very simple baseline To predict the generalization effect , Train and generate countermeasure network on training data set (GAN), The generalization is predicted by using the performance of generating synthetic data generated by the countermeasure network . Although I say GAN The generator of will render mode crash (collapse), in other words , The resulting distribution is only a small subset of the real distribution , And there are theories and experiments that , This may be difficult to avoid . But this paper finds that GAN The generated data allows a good estimate of the test error ( And generalization error ).

The algorithm in this paper is very simple , As described above .

Some experimental results are shown in the figure below , This simple method is used in most benchmark Has achieved very good results :

however GAN The data distribution generated by itself is not necessarily very diverse , And as mentioned above, it is easy to happen model collapse, So why it can be used to predict generalization in theory is unknown .

06 Poster——Uncertainty Modeling for Out-of-Distribution Generalization

Domain generalization (domain generlaization) It is a very hot research topic at present , It is considered that there is a distribution difference between the training data and the test data , At present, the commonly used method is to treat the feature as a certain value , Without considering their uncertainty . In this paper, we assume that the features consider the potential uncertainty and follow the multivariate Gaussian distribution . therefore , Each characteristic statistic is no longer a deterministic value , But probability points with different probability distributions . Statistics through uncertain characteristics , Models can be trained to mitigate domain perturbations and achieve better robustness to potential domain shifts . In terms of implementation, the algorithm is similar to data enhancement , Will determine a covariance matrix , The eigenvector itself is taken as the mean , Sample in range .

07 Poster——Gradient Matching for Domain Generalization

This paper proposes a new method to promote domain generalization . The main idea is to encourage greater inner product between gradients from different domains . The author proposes a method called FISH Optimization algorithm , Instead of adding an explicit specifier to accomplish this . The authors further show that their proposed method is effective for WILDS and DomainBED these benchmark Urban competitiveness .

author :yearn

| About Deep extension technology |

Shenyan technology was founded in 2018 year 1 month , Zhongguancun High tech enterprise , It is an enterprise with the world's leading artificial intelligence technology AI Service experts . In computer vision 、 Based on the core technology of natural language processing and data mining , The company launched four platform products —— Deep extension intelligent data annotation platform 、 Deep extension AI Development platform 、 Deep extension automatic machine learning platform 、 Deep extension AI Open platform , Provide data processing for enterprises 、 Model building and training 、 Privacy computing 、 One stop shop for Industry algorithms and solutions AI Platform services . 

原网站

版权声明
本文为[Shenlan Shenyan AI]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/03/202203020544261207.html