当前位置:网站首页>"Target detection" + "visual understanding" to realize the understanding and translation of the input image (with source code)
"Target detection" + "visual understanding" to realize the understanding and translation of the input image (with source code)
2022-07-01 11:01:00 【Computer Vision Research Institute】
Pay attention to the parallel stars
Never get lost
Institute of computer vision



official account ID|ComputerVisionGzq
Study Group | Scan the code to get the join mode on the homepage

Address of thesis :https://arxiv.org/pdf/2206.05836.pdf
Code address :https://github.com/microsoft/GLIP
Computer Vision Institute column
author :Edison_G
Put forward GLIPv2, Based on the VL Understanding model of , It serves localization Mission ( for example , object detection 、 Instance segmentation ) And visual language (VL) Understand the task ( for example ,VQA、 Image caption ).
01
summary
GLIPv2 Gracefully localization Pre training and visual language pre training (VLP) Combined with three pre training tasks :phrase grounding As a detection task VL restructure , Area - Word contrast learning as a new area - Word level contrastive learning task and mask language modeling . This unification not only simplifies the previous multi-stage VLP Program , And it realizes the mutual benefit between positioning and understanding tasks . Experimental results show that , Single GLIPv2 Model ( All model weights are shared ) The approach has been realized on various positioning and understanding tasks SoTA Performance of . The model also shows :
Strong zero sample and small sample adaptive performance on open vocabulary target detection task ;
stay VL Understand the excellence of the task grounding Ability
02
background
lately , People generally pay attention to the construction of general vision system , Also known as visual basic model , It can solve various visual tasks at the same time , For example, image classification 、 Object detection , And visual language (VL) understand . Of particular interest is the positioning task ( for example , Target detection and segmentation ) and VL Understand the task ( for example ,VQA And image captions ) Unity between .
localization Pre training is good for VL Mission ,“localization->VLP” The two-stage pre training process is VL Community . A long-standing challenge is localization And understanding , It aims to achieve mutual benefit between these two tasks , Simplify pre training procedures and reduce pre training costs .
However , The two tasks seem very different : The positioning task is only a visual task , Fine grained output is required ( for example , Bounding box or pixel mask ), and VL Understanding tasks emphasizes the integration between the two modes , High level semantic output is required . for example , Answer or title ).
03
New framework

Left: GLIPv2, a pre-trained grounded VL understanding model, unifies various localization and VL understanding tasks. These two kinds of tasks mutually benefit each other, and enables new capabilities such as language-guided detection/segmentation and grounded VQA/captioning. Right: Additional examples from ODinW (detection), LVIS (segmentation), VQA, and COCO Captioning.
A Unified VL Formulation and Architecture
GLIPv2 The core of the unified formula is the classification matching technique , It rephrases any task specific fixed vocabulary classification problem as a task independent open vocabulary visual language matching problem . The best example is in CLIP Rephrase image classification as image - Text matching , This enables the model to be directly extracted from the original image - Learning from text data , And achieve strong zero sample results on the open vocabulary classification task . stay GLIPv2 in , We use visual language matching point product layer to replace each semantic classification linear layer in the traditional unimodal visual model .

GLIPv2 Pre-training
GLIPv2 Use three pre training losses for pre training : Visual language reconstruction from target detection task phrase grounding Loss Lground、 Regional word contrast loss from the new regional word level contrast learning task Linter, And standard mask BERT Language modeling loss proposed in Lmlm.

Transfer GLIPv2 to Localization and VL Tasks
We have introduced two easy ways to GLIPv2 Methods of transmitting to various downstream tasks . Besides ,GLIPv2 Traditional VL Mission ( for example VQA), Effectively make every task we think become “ The basis of VL understand ” Mission .

GLIPv2 pre-training losses: the intra-image alignment loss Lintra (right) takes features after VL fusion and compute loss over region-word pairs within each image-text pair; the inter-image contrastive loss (left) Linter takes features before VL fusion and compute loss over all region-word pairs across a batch of image-text pairs. Label propagation is used to determine the off-diagonal blocks of the Linter target matrix.
04
Experiment and visualization




THE END
Please contact the official account for authorization.

The learning group of computer vision research institute is waiting for you to join !
ABOUT
Institute of computer vision
The Institute of computer vision is mainly involved in the field of deep learning , Mainly devoted to face detection 、 Face recognition , Multi target detection 、 Target tracking 、 Image segmentation and other research directions . The Research Institute will continue to share the latest paper algorithm new framework , The difference of our reform this time is , We need to focus on ” Research “. After that, we will share the practice process for the corresponding fields , Let us really experience the real scene of getting rid of the theory , Develop the habit of hands-on programming and brain thinking !
VX:2311123606

Previous recommendation
AI Help social security , The latest video abnormal behavior detection method framework
Improved shadow suppression for illumination robust face recognition
Text driven for creating and editing images ( With source code )
Based on hierarchical self - supervised learning, vision Transformer Scale to gigapixel images
YOLOS: Rethink through target detection Transformer( With source code )
Fast YOLO: For real-time embedded target detection ( Attached thesis download )
边栏推荐
- Combinaison Oracle et json
- Uncover the secrets of new products! Yadi Guanneng 3 multi product matrix to meet the travel needs of global users
- Intel Labs annonce de nouveaux progrès en photonique intégrée
- LeetCode 438. 找到字符串中所有字母异位词__滑动窗口
- How to get the maximum value of column two and regenerate the table when the SQL Server column one is the same
- CVPR 2022 | 基于密度与深度分解的自增强非成对图像去雾
- Matplotlib数据可视化基础
- Rising Stars in Plant Sciences (RSPS2022) Finalist科学演讲会(6.30晚9点)
- Wireshark TS | 快速重传和乱序之混淆
- 【Matytype】在CSDN博客中插入Mathtype行间与行内公式
猜你喜欢

使用强大的DBPack处理分布式事务(PHP使用教程)

Rising Stars in Plant Sciences (RSPS2022) Finalist科学演讲会(6.30晚9点)

Neurips 2022 | cell image segmentation competition officially launched!

“目标检测”+“视觉理解”实现对输入图像的理解及翻译(附源代码)

Mobile hard drive reads but does not display drive letter

Global filter (processing time format)

mysql如何把 一个数据库中的表数据 复制到 另一个数据库中(两个数据库不在同一个数据库链接下)

CRC 校驗

【AI资讯月刊】350+资源大盘点!6月不容错过的资料和动态,都都都在这里啦!<附下载>
![[paper reading] trajectory guided control prediction for end to end autonomous driving: a simple yet strong Ba](/img/fa/f2d24ee3dbbbe6332c84a82109338e.png)
[paper reading] trajectory guided control prediction for end to end autonomous driving: a simple yet strong Ba
随机推荐
华泰证券网上开户安全吗?
Crawler (2) - requests (1) | deep parsing of requests module
The project bar on the left side of CodeBlocks disappears, workspace automatically saves the project, default workspace, open the last workspace, workspace (Graphic tutorial, solved)
flutter path_provider: ^2.0.10可以获取临时目录
Submission lottery - light application server essay solicitation activity (may) award announcement
A new round of popularity of digital collections opens
Google's new paper Minerva: solving quantitative reasoning problems with language models
Huawei equipment is configured with large network WLAN basic services
[paper reading] trajectory guided control prediction for end to end autonomous driving: a simple yet strong Ba
PHP realizes lottery function
Detailed explanation of linear regression in machine learning
十年磨一剑:蚂蚁集团可观测性平台 AntMonitor 揭秘
Project0:小游戏
基金国际化的发展概况
106. construct binary tree from middle order and post order traversal sequence
CRC 校驗
12 product management platforms that everyone is using
CRC 校验
New situation of digital collection market
转义字符串