当前位置:网站首页>UC Berkeley proposes a multitask framework slip
UC Berkeley proposes a multitask framework slip
2022-07-03 21:36:00 【Doctor of artificial intelligence】
above Artificial intelligence algorithms and Python big data Get more dry goods
On the top right ··· Set to star *, Get resources the first time
Just for academic sharing , If there is any infringement , Contact deletion
Reproduced in : Almost Human
To explore CV Whether self supervised learning in the field will affect NLP field , From the University of California, Berkeley and Facebook AI Researchers in the Institute have proposed a new framework combining language supervision and image self supervision SLIP.
Recent studies have shown that , On challenging visual recognition tasks , Self supervised pre training can improve supervised learning .CLIP As a new method of supervised learning , It has shown excellent performance in various benchmark tests .
In recent days, , In order to explore whether the momentum of self supervised learning of images will enter the field of language supervision , From the University of California, Berkeley and Facebook AI Researchers at the Institute investigated CLIP Whether formal language supervision also benefits from image self supervision . The study notes that , It is not clear whether combining the two training goals will make performance stronger , But these two goals require the model to encode different and contradictory information about the quality of the image , This will cause interference .
Address of thesis :https://arxiv.org/abs/2112.12750v1
Project address :https://github.com/facebookresearch/SLIP
To explore these issues , This study proposes a multi task framework combining language supervision and self supervision SLIP(Self-supervision meet Language-Image Pre-training), And in YFCC100M Pre training on a subset of SLIP Model , The characterization quality was evaluated under three different settings : Zero sample migration 、 Linear classification and end-to-end fine tuning . Except for one group 25 Beyond the Classification Benchmark , The study is still ImageNet The performance of downstream tasks is evaluated on the data set .
This research is based on different model sizes 、 Experiments on training plans and pre training data sets further verify its findings . The research results finally show ,SLIP Significantly improved performance in most evaluation tests , This shows that self-monitoring has universal utility in the context of language monitoring . Besides , The researchers analyzed the various components of the method in more detail , For example, the selection of pre training data sets and data processing methods , The evaluation limitations of such methods are discussed .
SLIP frame
This research proposes a framework combining language supervision and image self supervision SLIP, To learn visual representations without category labels . During pre training , Build a separate view of each input image for the language supervision and image self supervision branches , Then feedback through the shared image encoder . During the training process, the image encoder learns to represent visual input in a semantically meaningful way . Then the study measures the quality of these learning representations by evaluating their utility in downstream tasks .
Method
The following algorithm 1 An overview of SLIP-SimCLR. stay SLIP During each forward pass in , All images are fed back through the same encoder .CLIP and SSL The goal is calculated on the correlation embedding , Then it is summed up as a single scalar loss , By readjusting SSL Goals to balance these two goals . The study will SLIP-SimCLR Referred to as SLIP.
SLIP Increased the number of image processing , This leads to about 3 Times more activation , Therefore, it will expand the memory occupation of the model and slow down the forward transfer speed in the training process .
Improved training process
CLIP The authors of mainly use include 400M Images - Train large private data sets of text , This reduces the need for regularization and data enhancement . It's happening again CLIP when , The researchers found some simple adjustments mainly for data enhancement . When in YFCC15M When pre training on , These adjustments significantly improve performance .
The study improved the training process , Use the improved ResNet-50 Realized 34.6% Zero samples of migrate to ImageNet, More than the original result 31.3%, by comparison , Another study CLIP Reappearance in ImageNet [29] It has been realized. 32.7% The accuracy of . The experiment of this study mainly focuses on vision Transformer Model (ViT) series , Because they have powerful extension behavior [17]. And the study used the improved process to train all ViT Model , In order to set a strong baseline for the evaluation and comparison of the methods proposed in this study .
Evaluation experiments
ImageNet Classification task
The study evaluated the model in three different settings ImageNet Performance on : Zero sample migration 、 Linear classification and end-to-end fine tuning .
The zero sample migration task directly evaluates the performance of the model on the Classification Benchmark after pre training , Without updating any model weights . By simply selecting the class closest to the input image , The model supervised by contrast language can be used as an image classifier ;
Linear classification , Also known as linear detection , It is a standard evaluation method for evaluating unsupervised or self supervised representations . Train the ultimate classification layer of random initialization , Freeze all other model weights at the same time ;
Last , Another way to evaluate the quality of representations is , When fine tuning the model end-to-end , Evaluate whether the pre training model can improve the performance of supervised learning .
A common evaluation setting in self supervised learning is ImageNet( namely ImageNet-1K) Upper training model and linear classifier , Even if there is no label , It is also a highly planned and class balanced dataset . surface 1 stay YFCC15M and ImageNet Upper use SimCLR and MoCo v3 Training ViT-B/16. stay ImageNet The linear classification and end-to-end fine tuning are evaluated on the model . When in YFCC15M instead of ImageNet When pre training on ,SimCLR and MoCo v3 The accuracy of linear classification has decreased 10% above , Sharp performance degradation .
The following table 2 Three sizes of Vision Transformer And all three ImageNet Set up CLIP、SimCLR and SLIP The results of the assessment of . All the models are in YFCC15M On training 25 individual epoch. The study found that language supervision and image self supervision are SLIP Interact constructively in , The performance of these two methods is improved separately .
Expansion of model scale and computation
In this part , Researchers have explored the use of greater computational effort ( Train longer ) And larger visual models ,SLIP How has your performance changed . They noticed ,YFCC15M Upper 100 A training epoch Corresponding ImageNet1K Upper 1200 A training epoch.
The following table 3 The results show that , Whether it's increasing training time , Or increase the size of the model ,SLIP Can achieve good expansion .
Other benchmarks
In the table below 4 in , The researchers evaluated a set of downstream image classification tasks zero-shot Transfer learning performance . These datasets span many different fields , Including daily scenes ( Such as traffic signs )、 Professional field ( Such as medical and satellite images )、 Video frame 、 Rendered text with or without visual context .
On these datasets , We see , Larger models and uses SLIP Longer training usually improves zero-shot The accuracy of transfer learning .
Other pre training data sets
except YFCC15M outside , The researchers also used two other images - Text data set ——CC12M and CC3M—— Experiments were carried out . The following table 5 Shown , They are CC12M and CC3M Use at the same time SLIP and CLIP Training ViT-B/16, And with them before YFCC15M Compare the data obtained on . Of all the ImageNet Evaluation setting ,SLIP All ratio CLIP There is room for improvement . It is worth noting that , stay CC12M instead of YCC15M Pre training SLIP Will produce a lower zero-shot Accuracy rate , But it will actually bring higher linearity and fine tuning performance .CLIP Let people see more amazing 1.6% The fine tuning performance of .
Other self-monitoring frameworks
The author mentioned in his paper ,SLIP Many different self-monitoring methods are allowed . They SimCLR Different alternatives to ——MoCo v3、BYOL and BeiT stay ViT-B/16 Several experiments were carried out on .
The following table 6 Show , None of the three alternatives performs as well SLIP-SimCLR. The most surprising result is , Even though BEiT It is the strongest self-monitoring method tested here , but SLIP-BEiT Your performance is the worst . This may be due to the large input difference between the pre training and deployment stages . For all that , All these suboptimal SLIP The variant is still better than CLIP High performance .
------------------
Statement : This content comes from the Internet , The copyright belongs to the original author
Picture source network , It does not represent the position of the official account . If there is any infringement , Contact deletion
AI Doctor's personal wechat , There are still a few vacancies
How to draw a beautiful deep learning model ?
How to draw a beautiful neural network diagram ?
Read all kinds of convolutions in deep learning
Let's have a look and support
边栏推荐
- 大神们,我想发两个广播流1 从mysql加载基础数据,广播出去2 从kafka加载基础数据的变更
- Idea shortcut word operation
- A little understanding of GSLB (global server load balance) technology
- leetcode-540. A single element in an ordered array
- How to choose cache read / write strategies in different business scenarios?
- "Designer universe" argument: Data Optimization in the design field ultimately falls on cost, safety and health | chinabrand.com org
- The White House held an open source security summit, attended by many technology giants
- C程序设计的初步认识
- 【愚公系列】2022年7月 Go教学课程 002-Go语言环境安装
- 全网都在疯传的《老板管理手册》(转)
猜你喜欢
Après 90 ans, j'ai démissionné pour démarrer une entreprise et j'ai dit que j'allais détruire la base de données Cloud.
Decompile and modify the non source exe or DLL with dnspy
抓包整理外篇——————autoResponder、composer 、statistics [ 三]
90 後,辭職創業,說要卷死雲數據庫
Yyds dry inventory hcie security Day12: concept of supplementary package filtering and security policy
MySQL - index
No matter how hot the metauniverse is, it cannot be separated from data
"Actbert" Baidu & Sydney University of technology proposed actbert to learn the global and local video text representation, which is effective in five video text tasks
Goodbye 2021, how do programmers go to the top of the disdain chain?
[gd32l233c-start] 5. FLASH read / write - use internal flash to store data
随机推荐
Imitation Netease cloud music applet
Yiwen teaches you how to choose your own NFT trading market
【愚公系列】2022年7月 Go教学课程 002-Go语言环境安装
Nmap and masscan have their own advantages and disadvantages. The basic commands are often mixed to increase output
MySQL——数据库备份
Capture de paquets et tri du contenu externe - - autoresponder, composer, statistiques [3]
(5) Web security | penetration testing | network security operating system database third-party security, with basic use of nmap and masscan
Collections SQL communes
Kubernetes 通信异常网络故障 解决思路
Preliminary understanding of C program design
Borui data and Sina Finance released the 2021 credit card industry development report
Pengcheng cup Web_ WP
仿网易云音乐小程序
Memory analyzer (MAT)
十大券商开户注册安全靠谱吗?有没有风险的?
Basic preprocessing and data enhancement of image data
Xai+ network security? Brandon University and others' latest "interpretable artificial intelligence in network security applications" overview, 33 page PDF describes its current situation, challenges,
flink sql-client 退出,表就会被清空怎么办?
Service discovery and load balancing mechanism -service
Single page application architecture