当前位置:网站首页>Share 𞓜 jointly pre training transformers on unpaired images and text
Share 𞓜 jointly pre training transformers on unpaired images and text
2022-06-11 04:54:00 【Shenlan Shenyan AI】
Recently, this column is devoted to multimodal machinetranslation , Recently, the work of multimodal joint representation is very hot , The author is also preparing to study several papers in the latest issue with you .
Today's article Google The work of distilling knowledge , hold BERT and ViT The ability to distill into a new model , This new model can represent both text and image .
Paper information
name :Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text
author :Qing Li, Boqing Gong, Yin Cui, Dan Kondratyuk, Xianzhi Du, Ming-Hsuan Yang, Matthew Brown
Institutions :Google, UCLA UCLA
research objective : Build a model that can represent text and pictures in a unified way
Method introduction
The following describes the author's model architecture and its training methods :
Model architecture : A unified
Transformer(Unified Transformer)
The author refers to the idea of multi task learning , hope CV and NLP Tasks share parameters as much as possible —— In the author's design Unified Transformer in ,CV and NLP Different... Are used only in the input model phase 'tokenizer' And 'embedding'、 The output layer uses different classifier head, All other parameters are shared .
In particular , The image 'tokenizer' And 'embedding' and ViT be similar , Break a picture down into smaller ones patch; Every little patch Convert to after linear transformation image embedding And add postition embedding As location information .

ViT Break the picture into small pieces patch, Linear transformation into image embedding And add postition embedding As location information
And the text 'tokenizer' And 'embedding' And BERT similar , That is, through BPE Thesaurus , Break the word into subwords , And converted to token embedding, Then add position embedding and segment embedding.

Graphics and text are converted to embedding after , Input to Unified Transformer.Unified Transformer It is insensitive to the original mode of input and the specific task .
Last , The author believes that whether it is pre training or fine tuning of downstream tasks , Can be regarded as different classification tasks . The author uses different for different tasks MLP+Softmax classifier . The figure below shows the author's overall model :

The author's model adopts two modes for the two modes tokenize+embeding Strategy , Different output layer parameters are used for different tasks ; Besides , All model parameters share .
model training : Knowledge distillation and gradient mask
Pre training stage , The author put BERT and ViT As teacher model, I hope that teacher model Of logits Distill knowledge to student model( The author's model ).
The author requires the model to predict the true input lable At the same time , It also needs to be as close as possible to teacher model Of logits near , A loss function is proposed :

The first term of the loss function is the predicted truth label Cross entropy of , The second item is teacher model and student model Of logits Of KL The divergence .α As a super parameter, it is used to adjust the proportion .
Handle and utilize BERT and ViT Do knowledge distillation , The author also proposes a gradient mask strategy to prevent CV The tasks and NLP Optimization difficulties caused by gradient conflicts between tasks . The specific method is to introduce a mask matrix

, Change the global gradient from CV Task gradients and NLP Task gradient direct addition is improved to masked addition :

In the formula

And are NLP The tasks and CV The gradient of the task , The mask matrix will be dynamically updated several times during the training process ( The author points out that the cost of updating is very small compared with that of training the main network ), To preserve NLP The most significant gradient feature in the task ( the reason being that NLP Mask of the task ), The author's update algorithm refers to ICLR’19 Best paper The Lottery Ticket Hypothesis( link :https://openreview.net/pdf?id=rJl-b3RcF7).
experimental result
The author's pre training uses ILSVRC-2012 ImageNet The pictures are classified as CV Pretraining task ,BooksCorpus And English Wikipedia Of mask language model As NLP Pretraining task , Some examples are given CV perhaps NLP As a fine tuning experiment . The author put tune Of BERT and ViT As a result of baseline. Here is a list of experiments :

See the author's ViT-BERT After fine tuning, it has achieved competitive performance with room for improvement : stay CV The average value on the task is close to ViT、 But in the NLP The average value on the task is BERT There is still a gap ( The author explains that this is because Recognizing Textual Entailment, namely RTE Task scoring is a drag , because RTE There is not enough fine tuning data for the task ).
meanwhile , Two baselines ViT and BERT On another modal task tune The effect is not very good , It shows that tune There is no way to cross the modal gap .
The author also did some qualitative experiments , Found that after a period of training , The underlying network is only for image perhaps NLP One of the tasks has a strong response , The high-level network has strong response to both modes . This is also consistent with the mainstream view of neural network interpretation : With the development of neural network , Features become more abstract , High level networks are no longer concerned with specific task formats , Be able to handle more abstract semantic information .

Each sub heat map represents the active output of a conversion block , Deepen from left to right 、 More steps from top to bottom .( Red ) It means that neurons have a high response to at least one image ,( green ) It means that neurons have a high response to at least one text ,( yellow = Red + green ) It means that neurons have a high response to both images and texts ,( black ) It means that neurons do not respond to images or texts .
The author makes a brief comment on
This is a very “ Gugewei ” The paper of . The method is simple and rough, which can be explained clearly in one sentence : Distillation of knowledge + Multi task learning can train strong models that can handle multiple modes . But while making great efforts to perform miracles , Google researchers may also give future AI The research direction points out the path , That is to say, multimodal modes can share parameters as much as possible for joint representation .
Besides , The performance of this model is not strong enough , stay CV and NLP The mission didn't completely beat the original baseline , explain CV and NLP The joint learning of the two modes shows more mutual interference than mutual promotion .
The new model is for zero-shot The downstream tasks of are even weaker . in consideration of BERT stay zero-shot Can make a lot of surprising effects , This article ViT-BERT It can be said that the method is far from the final form of multimodal joint mode , Its representation of the two modes can also be improved .
author :Minimum
| About Deep extension technology |

Shenyan technology was founded in 2018 year 1 month , Zhongguancun High tech enterprise , It is an enterprise with the world's leading artificial intelligence technology AI Service experts . In computer vision 、 Based on the core technology of natural language processing and data mining , The company launched four platform products —— Deep extension intelligent data annotation platform 、 Deep extension AI Development platform 、 Deep extension automatic machine learning platform 、 Deep extension AI Open platform , Provide data processing for enterprises 、 Model building and training 、 Privacy computing 、 One stop shop for Industry algorithms and solutions AI Platform services .
边栏推荐
- Record of serial baud rate
- C language test question 3 (grammar multiple choice question - including detailed explanation of knowledge points)
- The first master account of Chia Tai International till autumn
- Detailed explanation of network security bypass network card
- codesys 获取系统时间
- lower_bound,upper_bound,二分
- Batch naming picture names
- Free data | new library online | cnopendata data data of national heritage stores and auction enterprises
- go MPG
- C语言试题三(语法选择题——含知识点详解)
猜你喜欢

Using keras to build the basic model yingtailing flower

Free data | new library online | cnopendata data data of national heritage stores and auction enterprises

Huawei equipment is configured with bgp/mpls IP virtual private network address space overlap

Detailed decomposition of the shortest path problem in Figure

华为设备配置通过GRE接入虚拟专用网

What are the similarities and differences between the data center and the data warehouse?

Codesys get System Time

New library goes online | cnopendata immovable cultural relic data

New UI learning method subtraction professional version 34235 question bank learning method subtraction professional version applet source code

Acts: efficient test design (with an excellent test design tool)
随机推荐
2020-12-24
C language test question 3 (grammar multiple choice question - including detailed explanation of knowledge points)
Titanic rescued - re exploration of data mining (ideas + source code + results)
What is a smart network card? What is the function of the smart network card?
Possible errors during alphapose installation test
New UI learning method subtraction professional version 34235 question bank learning method subtraction professional version applet source code
C语言试题三(程序选择题进阶_含知识点详解)
Analysis of hidden dangers in the construction of Fuzhou chemical laboratory
Simple linear regression of sklearn series
PCB ground wire design_ Single point grounding_ Bobbin line bold
PostgreSQL database replication - background first-class citizen process walreceiver receiving and sending logic
C语言试题三(语法选择题——含知识点详解)
精益产品开发体系最佳实践及原则
Leetcode question brushing series - mode 2 (datastructure linked list) - 19:remove nth node from end of list (medium) delete the penultimate node in the linked list
Huawei equipment configures local virtual private network mutual access
Huawei equipment is configured with cross domain virtual private network
Leetcode classic guide
Leetcode question brushing series - mode 2 (datastructure linked list) - 24 (m): swap nodes in pairs exchange nodes in the linked list
Yolact paper reading and analysis
Sealem Finance打造Web3去中心化金融平台基础设施