当前位置：网站首页>UC Berkeley proposes a multitask framework slip

UC Berkeley proposes a multitask framework slip

2022-07-03 21:36:00 【Doctor of artificial intelligence】

above Artificial intelligence algorithms and Python big data Get more dry goods

On the top right ··· Set to star *, Get resources the first time

Just for academic sharing , If there is any infringement , Contact deletion

Reproduced in ： Almost Human

To explore CV Whether self supervised learning in the field will affect NLP field , From the University of California, Berkeley and Facebook AI Researchers in the Institute have proposed a new framework combining language supervision and image self supervision SLIP.

Recent studies have shown that , On challenging visual recognition tasks , Self supervised pre training can improve supervised learning .CLIP As a new method of supervised learning , It has shown excellent performance in various benchmark tests .

In recent days, , In order to explore whether the momentum of self supervised learning of images will enter the field of language supervision , From the University of California, Berkeley and Facebook AI Researchers at the Institute investigated CLIP Whether formal language supervision also benefits from image self supervision . The study notes that , It is not clear whether combining the two training goals will make performance stronger , But these two goals require the model to encode different and contradictory information about the quality of the image , This will cause interference .

Address of thesis ：https://arxiv.org/abs/2112.12750v1

Project address ：https://github.com/facebookresearch/SLIP

To explore these issues , This study proposes a multi task framework combining language supervision and self supervision SLIP（Self-supervision meet Language-Image Pre-training）, And in YFCC100M Pre training on a subset of SLIP Model , The characterization quality was evaluated under three different settings ： Zero sample migration 、 Linear classification and end-to-end fine tuning . Except for one group 25 Beyond the Classification Benchmark , The study is still ImageNet The performance of downstream tasks is evaluated on the data set .

This research is based on different model sizes 、 Experiments on training plans and pre training data sets further verify its findings . The research results finally show ,SLIP Significantly improved performance in most evaluation tests , This shows that self-monitoring has universal utility in the context of language monitoring . Besides , The researchers analyzed the various components of the method in more detail , For example, the selection of pre training data sets and data processing methods , The evaluation limitations of such methods are discussed .

SLIP frame

This research proposes a framework combining language supervision and image self supervision SLIP, To learn visual representations without category labels . During pre training , Build a separate view of each input image for the language supervision and image self supervision branches , Then feedback through the shared image encoder . During the training process, the image encoder learns to represent visual input in a semantically meaningful way . Then the study measures the quality of these learning representations by evaluating their utility in downstream tasks .

Method

The following algorithm 1 An overview of SLIP-SimCLR. stay SLIP During each forward pass in , All images are fed back through the same encoder .CLIP and SSL The goal is calculated on the correlation embedding , Then it is summed up as a single scalar loss , By readjusting SSL Goals to balance these two goals . The study will SLIP-SimCLR Referred to as SLIP.

SLIP Increased the number of image processing , This leads to about 3 Times more activation , Therefore, it will expand the memory occupation of the model and slow down the forward transfer speed in the training process .

Improved training process

CLIP The authors of mainly use include 400M Images - Train large private data sets of text , This reduces the need for regularization and data enhancement . It's happening again CLIP when , The researchers found some simple adjustments mainly for data enhancement . When in YFCC15M When pre training on , These adjustments significantly improve performance .

The study improved the training process , Use the improved ResNet-50 Realized 34.6% Zero samples of migrate to ImageNet, More than the original result 31.3%, by comparison , Another study CLIP Reappearance in ImageNet [29] It has been realized. 32.7% The accuracy of . The experiment of this study mainly focuses on vision Transformer Model （ViT） series , Because they have powerful extension behavior [17]. And the study used the improved process to train all ViT Model , In order to set a strong baseline for the evaluation and comparison of the methods proposed in this study .

Evaluation experiments

ImageNet Classification task

The study evaluated the model in three different settings ImageNet Performance on ： Zero sample migration 、 Linear classification and end-to-end fine tuning .

The zero sample migration task directly evaluates the performance of the model on the Classification Benchmark after pre training , Without updating any model weights . By simply selecting the class closest to the input image , The model supervised by contrast language can be used as an image classifier ;
Linear classification , Also known as linear detection , It is a standard evaluation method for evaluating unsupervised or self supervised representations . Train the ultimate classification layer of random initialization , Freeze all other model weights at the same time ;
Last , Another way to evaluate the quality of representations is , When fine tuning the model end-to-end , Evaluate whether the pre training model can improve the performance of supervised learning .

A common evaluation setting in self supervised learning is ImageNet（ namely ImageNet-1K） Upper training model and linear classifier , Even if there is no label , It is also a highly planned and class balanced dataset . surface 1 stay YFCC15M and ImageNet Upper use SimCLR and MoCo v3 Training ViT-B/16. stay ImageNet The linear classification and end-to-end fine tuning are evaluated on the model . When in YFCC15M instead of ImageNet When pre training on ,SimCLR and MoCo v3 The accuracy of linear classification has decreased 10% above , Sharp performance degradation .

The following table 2 Three sizes of Vision Transformer And all three ImageNet Set up CLIP、SimCLR and SLIP The results of the assessment of . All the models are in YFCC15M On training 25 individual epoch. The study found that language supervision and image self supervision are SLIP Interact constructively in , The performance of these two methods is improved separately .

Expansion of model scale and computation

In this part , Researchers have explored the use of greater computational effort （ Train longer ） And larger visual models ,SLIP How has your performance changed . They noticed ,YFCC15M Upper 100 A training epoch Corresponding ImageNet1K Upper 1200 A training epoch.

The following table 3 The results show that , Whether it's increasing training time , Or increase the size of the model ,SLIP Can achieve good expansion .

Other benchmarks

In the table below 4 in , The researchers evaluated a set of downstream image classification tasks zero-shot Transfer learning performance . These datasets span many different fields , Including daily scenes （ Such as traffic signs ）、 Professional field （ Such as medical and satellite images ）、 Video frame 、 Rendered text with or without visual context .

On these datasets , We see , Larger models and uses SLIP Longer training usually improves zero-shot The accuracy of transfer learning .

Other pre training data sets

except YFCC15M outside , The researchers also used two other images - Text data set ——CC12M and CC3M—— Experiments were carried out . The following table 5 Shown , They are CC12M and CC3M Use at the same time SLIP and CLIP Training ViT-B/16, And with them before YFCC15M Compare the data obtained on . Of all the ImageNet Evaluation setting ,SLIP All ratio CLIP There is room for improvement . It is worth noting that , stay CC12M instead of YCC15M Pre training SLIP Will produce a lower zero-shot Accuracy rate , But it will actually bring higher linearity and fine tuning performance .CLIP Let people see more amazing 1.6% The fine tuning performance of .

Other self-monitoring frameworks

The author mentioned in his paper ,SLIP Many different self-monitoring methods are allowed . They SimCLR Different alternatives to ——MoCo v3、BYOL and BeiT stay ViT-B/16 Several experiments were carried out on .

The following table 6 Show , None of the three alternatives performs as well SLIP-SimCLR. The most surprising result is , Even though BEiT It is the strongest self-monitoring method tested here , but SLIP-BEiT Your performance is the worst . This may be due to the large input difference between the pre training and deployment stages . For all that , All these suboptimal SLIP The variant is still better than CLIP High performance .