当前位置:网站首页>"Target detection" + "visual understanding" realizes the understanding of the input image
"Target detection" + "visual understanding" realizes the understanding of the input image
2022-07-01 04:00:00 【AI vision netqi】
Put forward GLIPv2, Based on the VL Understanding model of , It serves localization Mission ( for example , object detection 、 Instance segmentation ) And visual language (VL) Understand the task ( for example ,VQA、 Image caption ).
Address of thesis :https://arxiv.org/pdf/2206.05836.pdf
Code address :https://github.com/microsoft/GLIP
The minimum of the pre training model 2.5G,
01 summary
GLIPv2 Gracefully localization Pre training and visual language pre training (VLP) Combined with three pre training tasks :phrase grounding As a detection task VL restructure , Area - Word contrast learning as a new area - Word level contrastive learning task and mask language modeling . This unification not only simplifies the previous multi-stage VLP Program , And it realizes the mutual benefit between positioning and understanding tasks . Experimental results show that , Single GLIPv2 Model ( All model weights are shared ) The approach has been realized on various positioning and understanding tasks SoTA Performance of . The model also shows :
Strong zero sample and small sample adaptive performance on open vocabulary target detection task ;
stay VL Understand the excellence of the task grounding Ability
02 background
lately , People generally pay attention to the construction of general vision system , Also known as visual basic model , It can solve various visual tasks at the same time , For example, image classification 、 Object detection , And visual language (VL) understand . Of particular interest is the positioning task ( for example , Target detection and segmentation ) and VL Understand the task ( for example ,VQA And image captions ) Unity between .
localization Pre training is good for VL Mission ,“localization->VLP” The two-stage pre training process is VL Community . A long-standing challenge is localization And understanding , It aims to achieve mutual benefit between these two tasks , Simplify pre training procedures and reduce pre training costs .
However , The two tasks seem very different : The positioning task is only a visual task , Fine grained output is required ( for example , Bounding box or pixel mask ), and VL Understanding tasks emphasizes the integration between the two modes , High level semantic output is required . for example , Answer or title ).
03 New framework

Left: GLIPv2, a pre-trained grounded VL understanding model, unifies various localization and VL understanding tasks. These two kinds of tasks mutually benefit each other, and enables new capabilities such as language-guided detection/segmentation and grounded VQA/captioning. Right: Additional examples from ODinW (detection), LVIS (segmentation), VQA, and COCO Captioning.
A Unified VL Formulation and Architecture
GLIPv2 The core of the unified formula is the classification matching technique , It rephrases any task specific fixed vocabulary classification problem as a task independent open vocabulary visual language matching problem . The best example is in CLIP Rephrase image classification as image - Text matching , This enables the model to be directly extracted from the original image - Learning from text data , And achieve strong zero sample results on the open vocabulary classification task . stay GLIPv2 in , We use visual language matching point product layer to replace each semantic classification linear layer in the traditional unimodal visual model .

GLIPv2 Pre-training
GLIPv2 Use three pre training losses for pre training : Visual language reconstruction from target detection task phrase grounding Loss Lground、 Regional word contrast loss from the new regional word level contrast learning task Linter, And standard mask BERT Language modeling loss proposed in Lmlm.

Transfer GLIPv2 to Localization and VL Tasks
We have introduced two easy ways to GLIPv2 Methods of transmitting to various downstream tasks . Besides ,GLIPv2 Traditional VL Mission ( for example VQA), Effectively make every task we think become “ The basis of VL understand ” Mission .

GLIPv2 pre-training losses: the intra-image alignment loss Lintra (right) takes features after VL fusion and compute loss over region-word pairs within each image-text pair; the inter-image contrastive loss (left) Linter takes features before VL fusion and compute loss over all region-word pairs across a batch of image-text pairs. Label propagation is used to determine the off-diagonal blocks of the Linter target matrix.
04
Experiment and visualization



边栏推荐
- 有效的 @SuppressWarnings 警告名称
- The difference between MFC for static libraries and MFC for shared libraries
- 205. 同构字符串
- 242. valid Letter heteronyms
- 214. 最短回文串
- 【TA-霜狼_may-《百人计划》】2.1 色彩空间
- [TA frost wolf \u may- hundred talents plan] 1.2.2 matrix calculation
- [TA frost wolf \u may- hundred people plan] 2.3 introduction to common functions
- Visit the image URL stored by Alibaba cloud to preview the thumbnail directly on the web page instead of downloading it directly
- 8. string conversion integer (ATOI)
猜你喜欢
![[TA frost wolf \u may- hundred people plan] 2.3 introduction to common functions](/img/be/325f78dee744138a865c13d2c20475.png)
[TA frost wolf \u may- hundred people plan] 2.3 introduction to common functions

盘点华为云GaussDB(for Redis)六大秒级能力

Appium automation test foundation -- supplement: c/s architecture and b/s architecture description

类和对象收尾

[TA frost wolf _may - "hundred people plan"] 1.4 introduction to PC mobile phone graphics API

【TA-霜狼_may-《百人计划》】2.3 常用函数介绍

Web components series (VIII) -- custom component style settings
![[ta- frost wolf \u may- hundred people plan] 1.1 rendering pipeline](/img/af/4498382bc47d8c9ae41c407b9d1265.png)
[ta- frost wolf \u may- hundred people plan] 1.1 rendering pipeline

Deep learning | rnn/lstm of naturallanguageprocessing

创新界,聚势行 | 2022人大金仓“百城巡展”火热开启
随机推荐
多次跳槽后,月薪等于老同事的年薪
Unexpected token o in JSON at position 1, JSON parsing problem
Qt开发经验小技巧226-230
【JPCS出版】2022年第三届控制理论与应用国际会议(ICoCTA 2022)
Review column - message queue
431. encode n-ary tree as binary tree DFS
“目标检测“+“视觉理解“实现对输入图像的理解
[EI conference] 2022 international joint civil and Offshore Engineering Conference (jccme 2022)
类和对象收尾
Edge浏览器的小技巧:Enter+Ctrl可以自动将地址栏转换为网址
MFC窗口滚动条用法
DO280管理应用部署--RC
Access denied for user ‘ODBC‘@‘localhost‘ (using password: NO)
168. excel table column name
【TA-霜狼_may-《百人计划》】2.1 色彩空间
10. 正则表达式匹配
JMeter login failure, extracting login token, and obtaining token problem solving
165. 比较版本号
[shortcut key]
TS type gymnastics: illustrating a complex advanced type