当前位置:网站首页>Thesis 𞓜 jointly pre training transformers on unpaired images and text
Thesis 𞓜 jointly pre training transformers on unpaired images and text
2022-06-11 04:53:00 【Shenlan Shenyan AI】
Recently, this column is devoted to multimodal machinetranslation , Recently, the work of multimodal joint representation is very hot , The author is also preparing to study several papers in the latest issue with you .
Today's article Google The work of distilling knowledge , hold BERT and ViT The ability to distill into a new model , This new model can represent both text and image .
Thesis news
Paper title :Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text
author :Qing Li, Boqing Gong, Yin Cui, Dan Kondratyuk, Xianzhi Du, Ming-Hsuan Yang, Matthew Brown
Institutions :Google, UCLA UCLA
research objective
Build a model that can represent text and pictures in a unified way .
Method introduction
The following describes the author's model architecture and its training methods :
1、 Model architecture : A unified Transformer(Unified Transformer)
The author refers to the idea of multi task learning , hope CV and NLP Tasks share parameters as much as possible —— In the author's design Unified Transformer in ,CV and NLP Different... Are used only in the input model phase 'tokenizer' And 'embedding'、 The output layer uses different classifier head, All other parameters are shared .
In particular , The image 'tokenizer' And 'embedding' and ViT be similar , Break a picture down into smaller ones patch; Every little patch Convert to after linear transformation image embedding And add postition embedding As location information .

ViT Break the picture into small pieces patch, Linear transformation into image embedding And add postition embedding As location information
And the text 'tokenizer' And 'embedding' And BERT similar , That is, through BPE Thesaurus , Break the word into subwords , And converted to token embedding, Then add position embedding and segment embedding.

Graphics and text are converted to embedding after , Input to Unified Transformer.Unified Transformer It is insensitive to the original mode of input and the specific task .
Last , The author believes that whether it is pre training or fine tuning of downstream tasks , Can be regarded as different classification tasks . The author uses different for different tasks MLP+Softmax classifier . The figure below shows the author's overall model :

The author's model adopts two modes for the two modes tokenize+embeding Strategy , Different output layer parameters are used for different tasks ; Besides , All model parameters share .
2、 model training : Knowledge distillation and gradient mask
Pre training stage , The author put BERT and ViT As teacher model, I hope that teacher model Of logits Distill knowledge to student model( The author's model ).
The author requires the model to predict the true input lable At the same time , It also needs to be as close as possible to teacher model Of logits near , A loss function is proposed :

The first term of the loss function is the predicted truth label Cross entropy of , The second item is teacher model and student model Of logits Of KL The divergence .α As a super parameter, it is used to adjust the proportion .
Handle and utilize BERT and ViT Do knowledge distillation , The author also proposes a gradient mask strategy to prevent CV The tasks and NLP Optimization difficulties caused by gradient conflicts between tasks . The specific method is to introduce a mask matrix

, Change the global gradient from CV Task gradients and NLP Task gradient direct addition is improved to masked addition :

In the formula

and Namely NLP The tasks and CV The gradient of the task , Mask matrix Will be dynamically updated several times during training ( The author points out that the cost of updating is very small compared with that of training the main network ), To preserve NLP The most significant gradient feature in the task ( the reason being that NLP Mask of the task ), The author's update algorithm refers to ICLR’19 Best paper The Lottery Ticket Hypothesis.
Experimental results
The author's pre training uses ILSVRC-2012 ImageNet The pictures are classified as CV Pretraining task ,BooksCorpus And English Wikipedia Of mask language model As NLP Pretraining task , Some examples are given CV perhaps NLP As a fine tuning experiment . The author put tune Of BERT and ViT As a result of baseline. Here is a list of experiments :

See the author's ViT-BERT After fine tuning, it has achieved competitive performance with room for improvement : stay CV The average value on the task is close to ViT、 But in the NLP The average value on the task is BERT There is still a gap ( The author explains that this is because Recognizing Textual Entailment, namely RTE Task scoring is a drag , because RTE There is not enough fine tuning data for the task ).
meanwhile , Two baselines ViT and BERT On another modal task tune The effect is not very good , It shows that tune There is no way to cross the modal gap .
The author also did some qualitative experiments , Found that after a period of training , The underlying network is only for image perhaps NLP One of the tasks has a strong response , The high-level network has strong response to both modes . This is also consistent with the mainstream view of neural network interpretation : With the development of neural network , Features become more abstract , High level networks are no longer concerned with specific task formats , Be able to handle more abstract semantic information .

Each sub heat map represents the active output of a conversion block , Deepen from left to right 、 More steps from top to bottom .( Red ) It means that neurons have a high response to at least one image ,( green ) It means that neurons have a high response to at least one text ,( yellow = Red + green ) It means that neurons have a high response to both images and texts ,( black ) It means that neurons do not respond to images or texts .
The author makes a brief comment on
This is a very “ Gugewei ” The paper of . The method is simple and rough, which can be explained clearly in one sentence : Distillation of knowledge + Multi task learning can train strong models that can handle multiple modes . But while making great efforts to perform miracles , Google researchers may also give future AI The research direction points out the path , That is to say, multimodal modes can share parameters as much as possible for joint representation .
Besides , The performance of this model is not strong enough , stay CV and NLP The mission didn't completely beat the original baseline , explain CV and NLP The joint learning of the two modes shows more mutual interference than mutual promotion .
The new model is for zero-shot The downstream tasks of are even weaker . in consideration of BERT stay zero-shot Can make a lot of surprising effects , This article ViT-BERT It can be said that the method is far from the final form of multimodal joint mode , Its representation of the two modes can also be improved .
author :Minimum
| Deep extension technology |
Shenyan technology was founded in 2018 year 1 month , Zhongguancun High tech enterprise , It is an enterprise with the world's leading artificial intelligence technology AI Service experts . In computer vision 、 Based on the core technology of natural language processing and data mining , The company launched four platform products —— Deep extension intelligent data annotation platform 、 Deep extension AI Development platform 、 Deep extension automatic machine learning platform 、 Deep extension AI Open platform , Provide data processing for enterprises 、 Model building and training 、 Privacy computing 、 One stop shop for Industry algorithms and solutions AI Platform services .
边栏推荐
- Check the digital tube with a multimeter
- CoDeSys get system time
- Parametric contractual learning: comparative learning in long tail problems
- Overview of construction knowledge of Fuzhou mask clean workshop
- 选择数字资产托管人时,要问的 6 个问题
- New product release: Lianrui launched a dual port 10 Gigabit bypass network card
- Zhengda international qihuo: trading market
- Legend has it that setting shader attributes with shader ID can improve efficiency:)
- Tianchi - student test score forecast
- Acts: how to hide defects?
猜你喜欢

The third small class discussion on the fundamentals of information and communication

Tianchi - student test score forecast

华为设备配置MCE
![[Transformer]Is it Time to Replace CNNs with Transformers for Medical Images?](/img/83/7025050667c382857c032bdd8f6649.jpg)
[Transformer]Is it Time to Replace CNNs with Transformers for Medical Images?

Exness: liquidity series - order block, imbalance (II)

华为设备配置本地虚拟专用网互访

Lianrui: how to rationally see the independent R & D of domestic CPU and the development of domestic hardware

Codesys get System Time

Powerful new UI installation force artifact wechat applet source code + multiple templates support multiple traffic main modes

PostgreSQL database replication - background first-class citizen process walreceiver receiving and sending logic
随机推荐
Tianchi - student test score forecast
Exness: Liquidity Series - order Block, Unbalanced (II)
力扣(LeetCode)161. 相隔为 1 的编辑距离(2022.06.10)
[Transformer]CoAtNet:Marrying Convolution and Attention for All Data Sizes
Minor problems encountered in installing the deep learning environment -- the jupyter service is busy
Description of construction scheme of Meizhou P2 Laboratory
Overview of construction knowledge of Fuzhou mask clean workshop
The third small class discussion on the fundamentals of information and communication
Support vector machine -svm+ source code
C language test question 3 (advanced program multiple choice questions _ including detailed explanation of knowledge points)
Huawei equipment is configured with bgp/mpls IP virtual private network
[markdown syntax advanced] make your blog more exciting (III: common icon templates)
What are the similarities and differences between the data center and the data warehouse?
lower_ bound,upper_ Bound, two points
Go unit test example; Document reading and writing; serialize
Mindmanager22 professional mind mapping tool
四大MQ的区别
Top 100 video information of station B
CoDeSys get system time
[Transformer]Is it Time to Replace CNNs with Transformers for Medical Images?