当前位置:网站首页>Institute of automation, Chinese Academy of Sciences: a review of the latest visual language pre training

Institute of automation, Chinese Academy of Sciences: a review of the latest visual language pre training

2022-06-10 16:48:00 PaperWeekly

1d34abc168f379bdee58a18bc1bdc279.gif

b26424a90bb5c6c163b146d165c6130b.png

Paper title :

VLP: A Survey on Vision-Language Pre-training

Thesis link :

https://arxiv.org/abs/2202.09061

513abae12a6d8f81cd14cd78640d4285.png

Abstract

In the last few years , The emergence of pre training model will lead to computer vision (CV) And natural language processing (NLP) And other single-mode fields have entered a new era . A lot of work has shown that they are beneficial to downstream single-mode tasks , And avoid training new models from scratch . So can such a pre training model be applied to multimodal tasks ? Researchers have explored this problem and made great progress .

This paper investigates visual - Language pre training (VLP) The latest progress and new frontier of , Include images - Text and video - Text pre training . In order to make readers better grasp VLP, We start with feature extraction 、 Model architecture 、 Pre training objectives 、 The recent progress of pre training data set and downstream task is reviewed in five aspects . then , We summarized the specific VLP Model . Last , We talked about VLP New areas of . As far as we know , This is a VLP First overview of the field . We hope that this review will serve VLP Implications for future research in the field .

412699f94ddca01b3de4307223c0b148.png

Introduce

It has always been the unremitting goal of artificial intelligence researchers to make machines react in a way similar to human beings . In order for the machine to perceive and think , The researchers proposed a series of related tasks , For example, face recognition 、 Reading comprehension and man-machine dialogue , To train and evaluate the intelligence of the machine in specific aspects . say concretely , Domain experts manually build standard data sets , Then train and evaluate the relevant models on it .

However , Due to the limitation of related technology , It is often necessary to train on a large number of labeled data , To get better 、 A more capable model . The recent emergence is based on Transformer The pre training model of the structure alleviates this problem . They first pre train through self supervised learning , It usually uses auxiliary tasks ( Pre training objectives ) Automatically mining supervision signals from large-scale unlabeled data to train models , So as to learn the general representation .

then , They can achieve amazing results by fine tuning with only a small amount of manually tagged data on downstream tasks . since BERT In natural language processing (NLP) Since the emergence of , Various pre training models have sprung up in the single-mode field , For example, computer vision (CV) In the field of Vision Transformer(ViT) And voice Wave2Vec. A lot of work has shown that they are beneficial to downstream single-mode tasks , And avoid training new models from scratch .

Similar to the single-mode domain , There is also a problem of less high-quality annotation data in the multimodal field . A natural question is whether the above pre training method can be applied to multimodal tasks ? Researchers have explored this problem and made great progress . In this paper , We focus on the mainstream vision - Language pre training (VLP), Include images - Text and video - Text pre training .

VLP Mainly through pre training based on large-scale data to learn the semantic correspondence between different modes . for example , In the image - Text pre training , We expect the model to include in the text “ Dog ” And in the image “ Dog ” Related to . In the video - Text pre training , We expect the model to include objects in the text / Actions are mapped to objects in the video / action . In order to achieve this goal , Need clever design VLP Goals and model architecture , To allow the model to mine the association between different modes .

In order to make readers better understand VLP, We start with 5 A comprehensive review of its latest progress in three important aspects :

1) feature extraction : This section includes VLP Image in model 、 Preprocessing and representation of video and text ( See also 3 section );

2) Model architecture : We introduce... From two different perspectives VLP The architecture of the model : From the perspective of multimodal fusion, it can be divided into single flow and double flow , From the perspective of overall architecture design, it is divided into Encoder-only And Encoder-decoder ( See also 4 section );

3) Pre training objectives : The goal of pre training is VLP At the heart of , It is mainly used to guide the model to learn the information related to visual language . We summed up our special training goals , Divided into complement 、 matching 、 Timing and special types ( See also 5 section );

4) Pre training dataset : Data for VLP crucial . We briefly introduced VLP The mainstream corpus and its specific size ( See also 6 section );

5) Downstream tasks : Multiple tasks require cooperative knowledge of vision and language . We divide them into five categories : classification 、 Return to 、 retrieval 、 Build and other tasks . We also discussed the basic details and objectives of these tasks ( See also 7 section ).

Then we summarize the specific state-of-the-art (SOTA)VLP Model ( See also 8 section ). Last , We summarize the paper and comment on VLP The new frontier of ( See also 9 section ).

As far as we know , This is a VLP The first overview of the field . We hope that our review will help researchers better understand this field , And inspire them to design better models .

99edc6fc0a13c3a0067256990b7ec026.png

feature extraction

This section describes VLP How the model preprocesses and represents images 、 Video and text to get corresponding features .

3.1 Feature preprocessing

Image feature preprocessing mainly includes three methods : Regional features based on target detection , be based on CNN Grid features and based on ViT Of patch features .

Video feature preprocessing : First, the video is divided into frames , Get the image sequence , Then, the image is processed according to the above image feature preprocessing method .

Text feature preprocessing : Mainly follow BERT Pretreatment method of , Segment the input sentence into sub word sequences , Then close and add [CLS] and [SEP], The last input is expressed as the word embedding + Location embedding + segment embedding.

3.2  Characteristic means

In order to make full use of the single-mode pre training model ,VLP The model can input visual or text features into Transformer Encoder . say concretely ,VLP The model utilizes criteria with random initialization Transformer Encoders to generate visual or textual representations . Besides ,VLP The model can take advantage of the pre trained vision Transformer Yes, based on ViT Of patch Feature coding , for example ViT and DeiT.VLP The model can also use pre trained text Transformer Encode text features , for example BERT. For the sake of simplicity , We put these Transformer Name it Xformer.

See the paper for more details Section 2.

d37b765d3ee733f773aa1300a5e5fd4c.png

Model structure

In this section , We introduce... From two different perspectives VLP The architecture of the model :(1) From the perspective of multimodal fusion, it can be divided into single flow and double flow , as well as (2) From the perspective of overall architecture design, it can be divided into only-encoder And encoder-decoder.

ee812b11d67a4cd046dac193505b0b8e.png

4.1 Single-stream versus Dual-strea

Single stream architecture refers to the connection of text and visual features , Then enter a single Transformer modular , Such as Firgue 1(a) Shown .

Dual flow architecture means that text and visual features are not connected , It's sent independently to two different Transformer block , Such as Firgue 1(b) Shown .

4.2 Encoder-only versus Encoder-decoder

many VLP The model adopts encoder only architecture , The output mode is generated directly across the feed layer . by comparison , other VLP The model advocates the use of converter encoders - Decoder architecture , Where the cross modal representation is first fed into the decoder , Then feed into the output layer .

See the paper for more details Section 3.

7bacc9a999c4574d268c90828d91308c.png

Pre training objectives

This section describes how we pre train by using different pre training objectives VLP Model , This is important for learning vision - A universal representation of language is essential . We summarize the pre training objectives into four categories : completion 、 matching 、 Timing and specific types .

The completion type is to understand the mode by reconstructing the masked element with the remainder of the unmasked part , Include   Masked Language Modeling,Prefix Language Modeling,Masked Vision Modeling etc. ;

Matching type is to unify vision and language into a shared hidden space , To generate a general visual - Language means , Include Vision-Language Matching,Vision-Language Contrastive Learning,Word-Region Alignment etc. ;

Timing types learn good representation by reordering the input sequence of interrupts , Mainly for video related pre training , Such as Frame Order Modeling etc. ;

Special types consist of other pre training objectives , Such as visual Q & A and visual description .

See the paper for more details Section 4.

c2ad51eb27f7d76a0125e2acd53d9d0d.png

Pre training dataset

afc334fde2d251aafcdf1da38e9f0d1c.png

majority VLP Datasets are built by combining common datasets across different multimodal tasks . However , Some of the previous work , for example VideoBERT、ImageBERT、ALIGN and CLIP, Deal with large amounts of data collected from the Internet and train with their own data sets . ad locum , Some mainstream corpora and their scale information are shown in table 1 Shown .

6fbaa38c36a9e34172de518b830b437f.png

Downstream tasks

Various tasks require collaborative knowledge of vision and language . In this section , We will introduce the basic details and objectives of such tasks , And divide them into five categories : classification 、 Return to 、 retrieval 、 Build and other tasks , Among them, classification 、 Regression and retrieval tasks are also called comprehension tasks .

Classification tasks mainly include :Visual Question Answering(VQA)、Visual Question Answering(VQA)、Natural Language for Visual Reasoning(NLVR).、Visual Commonsense Reasoning(VCR) etc. ;

Regression tasks include Multi-modal Sentiment Analysis(MSA);

The retrieval task mainly refers to some visual tasks - Language retrieval task ;

Generation tasks include :Visual Dialogue(VD)、Visual Captioning(VC) etc. ;

Other tasks include :Multi-modal  Machine  Translation(MMT)、Vision-Language Navigation(VLN). etc. .

See the paper for more details Section 6.

ad8982358cc686a400672fc919c933b4.png

SOTA VLP models

Based on the above VLP Model 5 Big picture , We are concerned about the VLP The models are summarized :

ab8eb227172692c9eaa395ec06942269.png

See the paper for more details Section 7.

dea1e47b1a6e7655e46634091473eaa7.png

Summary and new frontiers

In this paper , We offer the first VLP review . We extract from features 、 Model architecture 、 Pre training objectives 、 The latest progress of pre training data set and downstream task is reviewed , And a detailed summary of the specific SOTA VLP Model . We hope that our review will help researchers better understand VLP, And stimulate new work to promote the development of this field . future , On the basis of existing work ,VLP It can be further developed from the following aspects :

1)Incorporating Acoustic Information. Most previous work on multimodal pre training has emphasized the joint modeling of language and vision , But it ignores the information hidden in the audio . Although the semantic information in audio may overlap with language , But audio can provide additional emotional information 、 Acoustic boundary information, etc . Besides , Pre training with audio enables the model to handle downstream tasks with acoustic input .

up to now , Cross text 、 The joint modeling and representation of vision and audio is still an open issue to be further studied . Some frontier work has clarified the future of this research field . And previous VLP Different models ,VATT Take the original audio as input , And estimate through noise comparison (NCE) Learn multimodal representation .

And VATT Different ,OPT Combine various multi-level masking strategies to learn cross text 、 Cross modal representation of images and audio , And it can also generate text and images . Some other work , for example AudioCLIP and MERLOT Reserve, It also shows their unique method of learning cross modal representation on three modes ;

2)Knowledgeable Learning and Cognitive. Although the existing VLP The model has achieved significant performance , But their essence is to fit large-scale multimodal data sets . send VLP The model is more knowledgeable for the future VLP Very important . For input visual and text , Rich knowledge of relevant external common sense world and illustrative scenarios , Can be used to enhance input , Accelerate model training and reasoning . Solving this problem requires a unified cognitive model architecture 、 Pre training objectives of knowledge guidance and support for interaction with new knowledge ;

3)Prompt Tuning. at present , Fine tuning will VLP The main method of transferring knowledge to downstream tasks . However , With the increase of model scale , Each downstream task has its fine tuning parameters , Resulting in inefficient parameters . Besides , Diversified downstream tasks also make the design of pre training and fine-tuning stages cumbersome , Cause them to exist gap.

lately ,Prompt Tuning stay NLP More and more attention has been paid to . Discrete or continuous by design Prompt And will MLM For specific downstream tasks , These models can a. Reduce the computational cost of fine tuning a large number of parameters ;b. Bridge the gap between pre training and fine tuning .Prompt Tuning Yes, excitation PLM A promising approach to the distribution of language and world knowledge in . The next step is to improve and migrate to multimodal scenes , Break the traditional paradigm , solve VLP The pain point of .

Thank you very much

thank TCCI Tianqiao Academy of brain sciences for PaperWeekly Support for .TCCI Focus on the brain to find out 、 Brain function and brain health .

Read more

a69efeb2610ed074cbb3d35dd5d13ecf.png

0cf3a8d59e78b6e44d444e55e25369c2.png

b232ff18ee68b44e1fc8567d1681c468.png

33565a122592aa7ef59c196ebc452504.gif

# cast draft   through Avenue #

  Let your words be seen by more people  

How to make more high-quality content reach the reader group in a shorter path , How about reducing the cost of finding quality content for readers ? The answer is : People you don't know .

There are always people you don't know , Know what you want to know .PaperWeekly Maybe it could be a bridge , Push different backgrounds 、 Scholars and academic inspiration in different directions collide with each other , There are more possibilities . 

PaperWeekly Encourage university laboratories or individuals to , Share all kinds of quality content on our platform , It can be Interpretation of the latest paper , It can also be Analysis of academic hot spots Scientific research experience or Competition experience explanation etc. . We have only one purpose , Let knowledge really flow .

  The basic requirements of the manuscript :

• The article is really personal Original works , Not published in public channels , For example, articles published or to be published on other platforms , Please clearly mark  

• It is suggested that  markdown  Format writing , The pictures are sent as attachments , The picture should be clear , No copyright issues

• PaperWeekly Respect the right of authorship , And will be adopted for each original first manuscript , Provide Competitive remuneration in the industry , Specifically, according to the amount of reading and the quality of the article, the ladder system is used for settlement

  Contribution channel :

• Send email :[email protected] 

• Please note your immediate contact information ( WeChat ), So that we can contact the author as soon as we choose the manuscript

• You can also directly add Xiaobian wechat (pwbot02) Quick contribution , remarks : full name - contribute

fae7267b155224d0123c86548288d95c.png

△ Long press add PaperWeekly Small make up

Now? , stay 「 You know 」 We can also be found

Go to Zhihu home page and search 「PaperWeekly」

Click on 「 Focus on 」 Subscribe to our column

·

904b30543b20c39f1e360154349858b5.png

原网站

版权声明
本文为[PaperWeekly]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/03/202203020948173180.html

随机推荐