当前位置:网站首页>[intensive reading of papers] grounded language image pre training (glip)
[intensive reading of papers] grounded language image pre training (glip)
2022-07-27 14:23:00 【joyce_ peng】
One . background
https://arxiv.org/abs/2112.03857
https://github.com/microsoft/GLIP

The task of this paper is phrase grounding, Belong to visual grounding A kind of .phrase grounding My task is to input sentences and pictures , Frame all the objects mentioned in the sentence .visual grounding Other tasks and details can be referred to
https://zhuanlan.zhihu.com/p/388504127
GLIP You can do both target detection and grounding,
- object detection :
In the field of amplification target detection SOTA,zero-shot It is better to , You can also do zero-shot Target detection task .
Compared with conventional target detection tasks, it has rich semantics . - grounding:
And Convention grounding Compared with tasks, we can do target detection tasks .
Two 、 contribution
contribution
- Target detection and phrase grounding Tasks are unified for pre training
- Expand visual semantics
- Strong transfer learning ability
performance
- 27M Training on related data . It has strong zero sample and small sample migration performance in target recognition tasks
- Zero-shot:coco val On 49.8AP,LVIS val On 26.9AP
- After fine tuning :COCO val On 60.8AP
- The downstream 13 Target detection tasks ,1 A sample of GLIP It can be done with Dynamic Head Compete with
3、 ... and 、 Method
3.1 Method 1: Detection and grounding Unified tasks

1. background:
For the detection data set :
Enter the tag name during training (person、hairdryer)、 box 、 picture .
Input pictures during the test , Predict box and tag names .
The training process is as follows :
2. background as grounding:
groudning The input to the model is the phrase 、 Boxes and pictures of nouns in phrases .
take object Model to grounding The way to : adopt prompt To convert tag names into phrases .
Such as coco Yes 80 A label , take 80 Tags are connected by commas , Add “Detect:”, To form short sentences .
The formula 2 Become a formula 3 In the process of ,T The size of will change , from Nc become NM
structure token: In the flow chart above ,M(sub-word tokens) Always better than phrase format c many , There are four reasons 1) Some phrases take up more than toeken Location , such as traffic light.2) Some phrases are separated into sub words, such as toothbrush Divided into tooth#, #brush.3) Some are added token, Like a comma ,Deteckt etc. ,4) The end will add [NoObj] Of token. During training ,phrase If it is a positive example , Multiple subwords They are all positive examples . When testing, there are multiple token The average of pro As a phrase probability.
3. detection and grounding linkage : By the above method , It can be used grounding Model to pre train detection tasks , So you can migrate GLIP The model makes zero-shot Detection of
3.2 Method 2: deep fusion, The combination of vision and language

fusion Some formulas are as follows :
O0 It's vision backbone Of feature, P0 It's the text backbone Of feature
X-MHA(cross-modality multi-head attention module)
L yes DyHead in DyHeadModules Number ,BERT Layer Add for .
attention Some are common in multimodal , such as co-attention、guided attention etc. . Refer to multimodal attention Other optimization .
DeepFusion advantage :
Improved phrase grounding effect
Make visual features language-aware
3.3 Method 3: Pre training with rich semantic data
grounding The semantics of data sets are very rich , Target detection does not exceed 2000 Categories , however grounding Data sets, such as Flickr30K It includes 4.4w Different phrases , The magnitude is different .
How to amplify grounding data :
- stay gold data(det+grounding) Train teachers GLIP
- Use this teacher model to predict 24M web image-text data , adopt NLP Analyze noun phrases , There is 5840 Different noun phrases
- The student model is gold data And fake tags grounding Training on data
Amplification effect :
The effect of student model is better than that of teacher model , For example, for some words ,vaccine The teacher model may not predict , But it can be predicted a small vial,subwords Right , whole phrase Will be right . When giving students unsupervised data of the model , Can be a small vial of vaccine The whole label is given to the student model as a learning label .
Four 、 experimental result

FourODs(2.66M data ) yes 4 A collection of detection data sets , Include objects365、OpenImages、VG Data sets ( except coco)、ImageNetBox.
GoldG+ The dataset includes 1.3M Data sets , Include Flickr30K、VG caption、GQA.
GoldG The dataset is GoldG+ In addition to the coco Data sets
4.1 The migration effect is on the detection data set
zero-shot stay coco On :
- Graphic data sets do not bring improvement
- C and B Greater than the improvement
- Objects365 It includes coco Of 80 individual

stay LVIS On the effect :
LVIS: Large scale fine-grained vocabulary level markup data set ,1000+ Category , Pineapple dices in pizza are also marked
Gold grounding It works (model C vs model B)
4.2 stay grounding On dataset
Flick 30k: Graphic matching grounding Data sets ,goldG This data set is included in 
4.3 Ablation Experiment - Detect data set impact
O365: 0.66M
GoldG: 0.8M
FourODs: 2.66M
But it's not O365+GoldG The effect is better 
4.4 other
If the positioning is not good , You can add prompt words to help better locate , The figure below adds flat and round
边栏推荐
- GoPro access - control and preview GoPro according to GoPro official document /demo
- 面向流行性疾病科普的用户问题理解与答案内容组织
- Slam overview Reading Note 4: a survey on deep learning for localization and mapping: towards the age of spatial 2020
- 井贤栋等蚂蚁集团高管不再担任阿里合伙人 确保独立决策
- SLAM综述阅读笔记七:Visual and Visual-Inertial SLAM: State of the Art, Classification,and Experimental 2021
- Mining enterprise association based on Enterprise Knowledge Map
- Slam overview Reading Note 6: slam research based on image semantics: application-oriented solutions for autonomous navigation of mobile robots 2020
- 阿里最新股权曝光:软银持股23.9% 蔡崇信持股1.4%
- HDU4565 So Easy! [matrix multiplication] [derivation]
- 基于在线问诊记录的抑郁症病患群组划分与特征分析
猜你喜欢

Flexible and easy to use WYSIWYG visual report

MySQL advanced II. Logical architecture analysis

知识关联视角下金融证券知识图谱构建与相关股票发现

YOLOX改进之一:添加CBAM、SE、ECA注意力机制

基于企业知识图谱的企业关联关系挖掘

Motion attitude control system of DOF pan tilt based on stm32

Advanced MySQL III. storage engine

Chapter3 data analysis of the U.S. general election gold offering project

文献翻译__tvreg v2:用于去噪、反卷积、修复和分割的变分成像方法(部分)
![[related contents of multithreading]](/img/2d/c8bde21f13a5305ba54e9b52bd1e89.png)
[related contents of multithreading]
随机推荐
万字详解 Google Play 上架应用标准包格式 AAB
开源版思源怎么私有部署
HDU4565 So Easy! [matrix multiplication] [derivation]
va_list 使用总结
[note] logistic regression
Slam overview Reading Note 7: visual and visual intangible slam: state of the art, classification, and empirical 2021
spark job 使用log4j appender 追加日志到本地文件或者mysql
面向不平衡数据的电子病历自动分类研究
【科普】精度和分辨率的区别与联系
纯c手写线程池
@Repository详解
PROFINET simulator tutorial
[training day3] reconstruction of roads [SPFA]
微策生物IPO过会:年营收12.6亿 睿泓投资与耀合医药是股东
CARLA 笔记(04)— Client 和 World (创建 Client、连接 World 、批处理对象、设置 Weather、设置 Lights、World snapshots)
Carla notes (04) - client and world (create client, connect world, batch object, set weather, set lights, world snapshots)
How to view revenue and expenditure by bookkeeping software
Jing Xiandong and other senior executives of ant group no longer serve as Alibaba partners to ensure independent decision-making
用命令如何返回上级目录
网上券商APP开户安全有保障吗?