当前位置：网站首页>Thesis 𞓜 jointly pre training transformers on unpaired images and text

Thesis 𞓜 jointly pre training transformers on unpaired images and text

2022-06-11 04:53:00 【Shenlan Shenyan AI】

Recently, this column is devoted to multimodal machinetranslation , Recently, the work of multimodal joint representation is very hot , The author is also preparing to study several papers in the latest issue with you .

Today's article Google The work of distilling knowledge , hold BERT and ViT The ability to distill into a new model , This new model can represent both text and image .

Thesis news

Paper title ：Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text

author ：Qing Li, Boqing Gong, Yin Cui, Dan Kondratyuk, Xianzhi Du, Ming-Hsuan Yang, Matthew Brown

Institutions ：Google, UCLA UCLA

research objective

Build a model that can represent text and pictures in a unified way .

Method introduction

The following describes the author's model architecture and its training methods ：

1、 Model architecture ： A unified Transformer(Unified Transformer)

The author refers to the idea of multi task learning , hope CV and NLP Tasks share parameters as much as possible —— In the author's design Unified Transformer in ,CV and NLP Different... Are used only in the input model phase 'tokenizer' And 'embedding'、 The output layer uses different classifier head, All other parameters are shared .

In particular , The image 'tokenizer' And 'embedding' and ViT be similar , Break a picture down into smaller ones patch; Every little patch Convert to after linear transformation image embedding And add postition embedding As location information .

ViT Break the picture into small pieces patch, Linear transformation into image embedding And add postition embedding As location information

And the text 'tokenizer' And 'embedding' And BERT similar , That is, through BPE Thesaurus , Break the word into subwords , And converted to token embedding, Then add position embedding and segment embedding.

Graphics and text are converted to embedding after , Input to Unified Transformer.Unified Transformer It is insensitive to the original mode of input and the specific task .

Last , The author believes that whether it is pre training or fine tuning of downstream tasks , Can be regarded as different classification tasks . The author uses different for different tasks MLP+Softmax classifier . The figure below shows the author's overall model ：

The author's model adopts two modes for the two modes tokenize+embeding Strategy , Different output layer parameters are used for different tasks ; Besides , All model parameters share .

2、 model training ： Knowledge distillation and gradient mask

Pre training stage , The author put BERT and ViT As teacher model, I hope that teacher model Of logits Distill knowledge to student model（ The author's model ）.

The author requires the model to predict the true input lable At the same time , It also needs to be as close as possible to teacher model Of logits near , A loss function is proposed ：

The first term of the loss function is the predicted truth label Cross entropy of , The second item is teacher model and student model Of logits Of KL The divergence .α As a super parameter, it is used to adjust the proportion .

Handle and utilize BERT and ViT Do knowledge distillation , The author also proposes a gradient mask strategy to prevent CV The tasks and NLP Optimization difficulties caused by gradient conflicts between tasks . The specific method is to introduce a mask matrix

, Change the global gradient from CV Task gradients and NLP Task gradient direct addition is improved to masked addition ：

In the formula

and Namely NLP The tasks and CV The gradient of the task , Mask matrix Will be dynamically updated several times during training （ The author points out that the cost of updating is very small compared with that of training the main network ）, To preserve NLP The most significant gradient feature in the task （ the reason being that NLP Mask of the task ）, The author's update algorithm refers to ICLR’19 Best paper The Lottery Ticket Hypothesis.

Experimental results

The author's pre training uses ILSVRC-2012 ImageNet The pictures are classified as CV Pretraining task ,BooksCorpus And English Wikipedia Of mask language model As NLP Pretraining task , Some examples are given CV perhaps NLP As a fine tuning experiment . The author put tune Of BERT and ViT As a result of baseline. Here is a list of experiments ：

See the author's ViT-BERT After fine tuning, it has achieved competitive performance with room for improvement ： stay CV The average value on the task is close to ViT、 But in the NLP The average value on the task is BERT There is still a gap （ The author explains that this is because Recognizing Textual Entailment, namely RTE Task scoring is a drag , because RTE There is not enough fine tuning data for the task ）.

meanwhile , Two baselines ViT and BERT On another modal task tune The effect is not very good , It shows that tune There is no way to cross the modal gap .

The author also did some qualitative experiments , Found that after a period of training , The underlying network is only for image perhaps NLP One of the tasks has a strong response , The high-level network has strong response to both modes . This is also consistent with the mainstream view of neural network interpretation ： With the development of neural network , Features become more abstract , High level networks are no longer concerned with specific task formats , Be able to handle more abstract semantic information .

Each sub heat map represents the active output of a conversion block , Deepen from left to right 、 More steps from top to bottom .（ Red ） It means that neurons have a high response to at least one image ,（ green ） It means that neurons have a high response to at least one text ,（ yellow = Red + green ） It means that neurons have a high response to both images and texts ,（ black ） It means that neurons do not respond to images or texts .

The author makes a brief comment on

This is a very “ Gugewei ” The paper of . The method is simple and rough, which can be explained clearly in one sentence ： Distillation of knowledge + Multi task learning can train strong models that can handle multiple modes . But while making great efforts to perform miracles , Google researchers may also give future AI The research direction points out the path , That is to say, multimodal modes can share parameters as much as possible for joint representation .

Besides , The performance of this model is not strong enough , stay CV and NLP The mission didn't completely beat the original baseline , explain CV and NLP The joint learning of the two modes shows more mutual interference than mutual promotion .

The new model is for zero-shot The downstream tasks of are even weaker . in consideration of BERT stay zero-shot Can make a lot of surprising effects , This article ViT-BERT It can be said that the method is far from the final form of multimodal joint mode , Its representation of the two modes can also be improved .

author ：Minimum

｜ Deep extension technology ｜

Shenyan technology was founded in 2018 year 1 month , Zhongguancun High tech enterprise , It is an enterprise with the world's leading artificial intelligence technology AI Service experts . In computer vision 、 Based on the core technology of natural language processing and data mining , The company launched four platform products —— Deep extension intelligent data annotation platform 、 Deep extension AI Development platform 、 Deep extension automatic machine learning platform 、 Deep extension AI Open platform , Provide data processing for enterprises 、 Model building and training 、 Privacy computing 、 One stop shop for Industry algorithms and solutions AI Platform services .

原网站

版权声明
本文为[Shenlan Shenyan AI]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/03/202203020544262326.html