当前位置:网站首页>ICML 2022 | ByteDance AI Lab proposes a multimodal model: x-vlm, learning multi granularity alignment of vision and language
ICML 2022 | ByteDance AI Lab proposes a multimodal model: x-vlm, learning multi granularity alignment of vision and language
2022-06-25 03:41:00 【QbitAl】
Write it at the front
Visual language pre training improves the performance of many downstream visual language tasks , for example : Image Retrieval 、 Picture based Q & A or reasoning . Have a friend to ask , In addition to using larger models for open academic tasks / More data / Skills brush the indicators very high , What are the practical applications of the multimodal pre training model ?

So , Bytes to beat AI Lab Research The team put forward X-VLM, For the first time, it is proposed to learn multi granularity visual and language alignment . Experimental proof , This pre training method is very efficient , The scale of the model does not need to be very large , Pre training data does not need a lot , only 216M Parameter quantity X-VLM You can achieve excellent performance in a wide range of multimodal tasks , for example : Image text retrieval 、 Picture based Q & A or reasoning 、 Visual positioning 、 Picture description generation . at present ,X-VLM In the real application scenario of ByteDance, it exceeds many models commonly used in the industry , Online completed , Serving businesses such as today's headlines . Related papers have been published by ICML 2022 receive .

The paper :https://arxiv.org/abs/2111.08276
Code :https://github.com/zengyan-97/X-VLM
such as ,X-VLM Learned multi granular visual and language alignment , It can generate more correct sentences for pictures to describe objects and relationships between objects , This ability has been applied to the public welfare project of ByteDance . Mr. Zhao, who has visual impairment, often uses today's headlines to understand current affairs and news , He has always had an expectation :“ Hope to be the same as ordinary people ‘ see ’ To all information content .” More than two-thirds of today's headlines contain pictures , In order to solve the problem of reading pictures for the visually impaired , Today's headline App Recently applied X-VLM Generation capacity of , It can automatically identify pictures and describe them .

To make them “ see ” See each picture , We made a small improvement
Besides ,X-VLM The ability to understand and generate is also used in the automatic correction function of the energetically intelligent learning lamp . The following figure shows the complete phrase question type and the results predicted by the model :

The powerful intelligent learning lamp with automatic problem-solving function has been widely praised by parents , This capability is still under continuous optimization .

Research background

The existing multimodal pre training models can be roughly divided into two categories :
1) Depending on the target detector extraction based on the object ( for example : vehicle 、 people 、 Trees 、 knapsack ) To represent a picture , This method can learn object level visual and language alignment , Pictured 1 in (a) Shown . These methods either directly use the pre trained target detector , Or combine the target detection process into multimodal pre training ;
2) use ResNet perhaps Vision Transformer Encode the whole picture , Just learn the alignment between the picture and the text , Pictured 1(b) Shown .
There are some problems in both methods . First , The method based on object detection can recognize all possible objects in the picture , Some of them have nothing to do with paired text . Besides , The object-based visual features extracted by this method may lose the information between objects ( It can be considered as a kind of contextual information ). and , This method can only identify a limited number of objects , It is difficult to predefine the appropriate object categories . The second method is simple and direct , But it is difficult to learn fine-grained visual and language alignment , for example : Object level alignment . This fine-grained alignment relationship has been confirmed by previous work for visual reasoning (visual reasoning) And visual positioning (visual grounding) The task is very helpful .
actually , For multimodal pre training , The following public data is available for use by the model :1) Picture and picture title ;2) Area labeling , for example : chart 1 The text in the “man crossing the street” Associated with a specific area in the picture . However , Previous work has roughly aligned the area labels with the whole picture ;3) Object label , for example “backpack”, These annotations are used to train the target detector in the previous work .
Different from the previous practice , In this paper, the author proposes X-VLM, Use the above data in a unified way to efficiently learn multi granularity visual and language alignment , It can avoid high overhead target detection process , Nor is it limited to learning image level or object level alignment . say concretely , The author suggests that the Vision Transformer Of patch embeddings To flexibly represent the visual concepts of various particle sizes , Pictured 1(c) Shown : for example , Visual concepts “backpack” from 2 individual patch form , And the visual concept “man crossing the street” By more patch form .
therefore ,X-VLM The secret to learning multi granularity visual and language alignment is :
1) Use patch embeddings To flexibly represent visual concepts of various granularity , Then directly pull together the visual concepts and corresponding texts of different granularity , This process uses the commonly used comparative learning loss 、 Match loss 、 and MLM Loss optimization ;
2) Further more , In the same picture , Give different texts , The model is required to predict the coordinates of the visual concept corresponding to the granularity , Optimization with regression loss and intersection union ratio loss of boundary box coordinates . Experimental proof , This pre training method Very efficient , The scale of the model does not need to be very large , Pre training data does not need a lot ,X-VLM It can be understood in multiple modes downstream / Excellent performance on generation tasks **.
Method

X-VLM By an image encoder , A text encoder , A cross modal encoder consists of .
chart 2 The visual concept is given on the left ( It can be an object / Area / picture ) The coding process : The image encoder is based on Vision Transformer, Divide the input picture into patch code . then , Give any bounding box , Flexible access to all... In the box patch The average value of the representation obtains the global representation of the region . Then compare the global representation with all in the original box patch Means to arrange into a sequence according to the original order , As the representation of the visual concept corresponding to the bounding box . Get the picture itself in this way (I) And visual concepts in pictures (V1,V2,V3) The coding . Text corresponding to visual concepts , One by one by text encoder , For example, picture title 、 Area description 、 Or object labels .
X-VLM Adopt the common model structure , The difference lies in the method of pre training . The author optimizes through the following two types of losses :
First of all , In the same picture , Give different texts , for example :T(text)、T1(text1)、T2(text2)、T3(text3), The model is required to predict the bounding box of the corresponding visual concept in the picture :


xjcls Is the cross modal encoder in [CLS] The output vector of the position .Sigmoid The function is to standardize the bounding box of prediction .Ground-truth bj Corresponding , In turn are the standardized abscissa of the center 、 Center ordinate 、 wide 、 high . Last , This loss is the regression loss of the bounding box coordinates (L1) Compare losses with each other (GIoU) The sum of the . The author thinks that in the same picture , Give different words , The model is required to predict the corresponding visual concept , It can make the model learn multi granularity visual language alignment more effectively . This loss is also the first time it has been used in multimodal pre training .
second , Use patch embeddings To flexibly represent visual concepts of various granularity , Then directly optimize the model to pull together different granularity text and visual concepts , Including objects / Area / Alignment of picture and text . The author uses three common loss optimization methods in multimodal pre training , In turn, is :
1) Compare learning loss :

yv2t,yt2v ∈ Rbsz x bsz yes ground-truth Similarity degree , The diagonal to 1, Others are 0.
pv2t, pt2v ∈ Rbsz x bsz It is the similarity calculated by the model based on the output of text encoder and image encoder .
2) Match loss :

pmatch It is based on the calculation of cross modal encoder , Forecast given Whether it matches ( let me put it another way ,0/1 classification ). For each positive example , The author sampled a pair of negative cases .
3)Masked Language Modeling Loss :

T( Estimated value ) Some words in have been randomly replaced with [MASK],pj(V, T( Estimated value )) It's a cross modal encoder in the word tj The probability distribution of thesaurus calculated by the output vector of position .
experiment
The author uses the medium-sized model commonly used in multimodal pre training 4M and 16M Experiment with image data set , As shown in the following table :

among , mark (# Ann) Is the sum of area labels and object labels . It can be seen that , Some datasets don't have picture titles , for example Visual Genome(VG), Some datasets have no picture labels , for example CC-3M/12M.

surface 2 It shows the task of image text retrieval (MSCOCO and Flickr30K) Performance on . Even if , The previous methods are pre trained on a larger amount of internal data or have a larger model scale , stay 4M Training under picture data set X-VLM You can go beyond the previous methods .

surface 3 Demonstrated in visual reasoning (VQA2.0 and NLVR2)、 Visual positioning (RefCOCO+) 、 Picture description generation (COCO Caption) Model performance on . For a fair comparison ,X-VLM It follows the previous work fine-tune Method , No additional adjustments have been made . Combine tables 2 And table 3, It can be seen that , Compared with the previous method ,X-VLM Support more kinds of downstream tasks , And in these common visual language tasks have achieved very good performance .
Summarize and discuss
In this paper , The author puts forward X-VLM To learn multi granular visual and language alignment , It can avoid high overhead target detection process , Nor is it limited to learning image level or object level alignment .X-VLM The secret to success is :
1) be based on patch embeddings Flexible presentation of visual concepts of various granularity , Then directly pull together the visual concepts and corresponding texts of different granularity ;
2) Further more , In the same picture , Give different texts , The model is required to predict the coordinates corresponding to the visual concept . Experiments show that this pre training method is very efficient .
In the experimental part , The author uses the commonly used 4M and 16M data , Total training parameters 216M Of X-VLM , It can surpass a larger scale model or a model using a large amount of pre training data , In the downstream multi-modal understanding / Excellent performance on generation tasks . also , The engineers of ByteDance also put X-VLM Used in real business scenarios , for example : Describe the picture content for visually impaired people , Automatic correction of pupils' homework . actually ,X-VLM He is also very good at fine-grained retrieval,visual grounding Etc .
at present ,X-VLM Our code is open source , You are also welcome to scan the QR code below to do your own tasks fine-tune Experience .

* This article is authorized by qubit to publish , Opinions are owned only by the author .
— End —
「 Smart car 」 Communication group recruitment !
Welcome to smart cars 、 Self driving partners join the community , Communicate with industry celebrities 、 Compare notes , Don't miss the development of smart car industry & Technological progress .
ps. Please note your name when adding friends - company - Position oh ~

边栏推荐
- 2022年海外电商运营三大关键讲解
- MySql安裝教程
- 1-6 build win7 virtual machine environment
- Rebeco:使用机器学习预测股票崩盘风险
- MySQL learning notes -- addition, deletion, modification and query on a single table
- Tencent's open source project "Yinglong" has become a top-level project of Apache: the former long-term service wechat payment can hold a million billion level of data stream processing
- @PostConstruct
- C语言数组与结构体指针
- Background page production 01 production of IVX low code sign in system
- 同花顺证券开户安全吗
猜你喜欢

EasyNVR使用Onvif探测设备失败,显示“无数据”是什么原因?

Collaboration + Security + storage, cloud box helps Shenzhen edetai restructure its data center

现在,耳朵也要进入元宇宙了

Insurance app aging service evaluation analysis 2022 issue 06

存算一体芯片离普及还有多远?听听从业者怎么说 | 对撞派 x 后摩智能

后台页制作01《ivx低代码签到系统制作》

Two way combination of business and technology to build a bank data security management system

About sizeof() and strlen in array

Solution of separating matlab main window and editor window into two interfaces

Self cultivation and learning encouragement
随机推荐
站在风暴中心:如何给飞奔中的腾讯更换引擎
AI自己写代码让智能体进化!OpenAI的大模型有“人类思想”那味了
Wechat applet obtains the parameters carried after scanning the QR code
Cloud native database vs traditional database
股票开户,在手机上开户安全吗?
love
Expressing the transformation of two coordinate systems with vectors
Is it safe to open a stock account on Huatai Securities?
Easy to use dictionary -defaultdict
Two common OEE monitoring methods for equipment utilization
使用XXL-JOB自定义任务并调度
C语言数组与结构体指针
There is the word "Internet" in the concept of industrial Internet, but it is an existence that is not related to the Internet
Is flush a regular platform? Is it safe for flush to open an account
Three key explanations of overseas e-commerce operation in 2022
Rebeco: using machine learning to predict stock crash risk
Self cultivation and learning encouragement
孙武玩《魔兽》?有图有真相
MCN institutions are blooming everywhere: bloggers and authors should sign contracts carefully, and the industry is very deep
How to choose a securities company when opening an account with a compass? Which is safer