当前位置:网站首页>"Target detection" + "visual understanding" to realize the understanding and translation of the input image (with source code)

"Target detection" + "visual understanding" to realize the understanding and translation of the input image (with source code)

2022-07-01 11:01:00 Computer Vision Research Institute

Pay attention to the parallel stars

Never get lost

Institute of computer vision

dab29c5e5cad86aab9cc656ff706865b.gif

5fd90564e5faecfbf9f6b279131a4a96.gif

c6e038d196bf99359bd6b107b883025e.png

official account ID|ComputerVisionGzq

Study Group | Scan the code to get the join mode on the homepage

ff7207b1345a759c97af465eb7af21e4.png

Address of thesis :https://arxiv.org/pdf/2206.05836.pdf

Code address :https://github.com/microsoft/GLIP

Computer Vision Institute column

author :Edison_G

Put forward GLIPv2, Based on the VL Understanding model of , It serves localization Mission ( for example , object detection 、 Instance segmentation ) And visual language (VL) Understand the task ( for example ,VQA、 Image caption ).

01

summary

GLIPv2 Gracefully localization Pre training and visual language pre training (VLP) Combined with three pre training tasks :phrase grounding As a detection task VL restructure , Area - Word contrast learning as a new area - Word level contrastive learning task and mask language modeling . This unification not only simplifies the previous multi-stage VLP Program , And it realizes the mutual benefit between positioning and understanding tasks . Experimental results show that , Single GLIPv2 Model ( All model weights are shared ) The approach has been realized on various positioning and understanding tasks SoTA Performance of . The model also shows :

  • Strong zero sample and small sample adaptive performance on open vocabulary target detection task ;

  • stay VL Understand the excellence of the task grounding Ability

02

background

lately , People generally pay attention to the construction of general vision system , Also known as visual basic model , It can solve various visual tasks at the same time , For example, image classification 、 Object detection , And visual language (VL) understand . Of particular interest is the positioning task ( for example , Target detection and segmentation ) and VL Understand the task ( for example ,VQA And image captions ) Unity between .

localization Pre training is good for VL Mission ,“localization->VLP” The two-stage pre training process is VL Community . A long-standing challenge is localization And understanding , It aims to achieve mutual benefit between these two tasks , Simplify pre training procedures and reduce pre training costs .

However , The two tasks seem very different : The positioning task is only a visual task , Fine grained output is required ( for example , Bounding box or pixel mask ), and VL Understanding tasks emphasizes the integration between the two modes , High level semantic output is required . for example , Answer or title ).

03

New framework

da5f88a32ddf0c707df7c72c904d2e29.png

Left: GLIPv2, a pre-trained grounded VL understanding model, unifies various localization and VL understanding tasks. These two kinds of tasks mutually benefit each other, and enables new capabilities such as language-guided detection/segmentation and grounded VQA/captioning. Right: Additional examples from ODinW (detection), LVIS (segmentation), VQA, and COCO Captioning.

A Unified VL Formulation and Architecture

GLIPv2 The core of the unified formula is the classification matching technique , It rephrases any task specific fixed vocabulary classification problem as a task independent open vocabulary visual language matching problem . The best example is in CLIP Rephrase image classification as image - Text matching , This enables the model to be directly extracted from the original image - Learning from text data , And achieve strong zero sample results on the open vocabulary classification task . stay GLIPv2 in , We use visual language matching point product layer to replace each semantic classification linear layer in the traditional unimodal visual model .

07be8e9cb6e8d79f8b10df67c6421e82.png

GLIPv2 Pre-training

GLIPv2 Use three pre training losses for pre training : Visual language reconstruction from target detection task phrase grounding Loss Lground、 Regional word contrast loss from the new regional word level contrast learning task Linter, And standard mask BERT Language modeling loss proposed in Lmlm.

4879bcd253db39953a6a37251ed81dcb.png

Transfer GLIPv2 to Localization and VL Tasks

We have introduced two easy ways to GLIPv2 Methods of transmitting to various downstream tasks . Besides ,GLIPv2 Traditional VL Mission ( for example VQA), Effectively make every task we think become “ The basis of VL understand ” Mission .

c3b174e0e40a58de4d832133eddce2fb.png

GLIPv2 pre-training losses: the intra-image alignment loss Lintra (right) takes features after VL fusion and compute loss over region-word pairs within each image-text pair; the inter-image contrastive loss (left) Linter takes features before VL fusion and compute loss over all region-word pairs across a batch of image-text pairs. Label propagation is used to determine the off-diagonal blocks of the Linter target matrix.

04

Experiment and visualization

4ed58068df3c59d2c0f8707f14b09f4d.png

8256411d0de01d0bcc1b8cb1aa3cb77a.png

10f7ca1059e4b1c53155288817828b2d.png

2f81bbc9430f235634742df2d4c4e4f3.png

 THE END 

Please contact the official account for authorization.

b269cdfdee6b8d020a67b2bd4bd76c75.gif

The learning group of computer vision research institute is waiting for you to join !

ABOUT

Institute of computer vision

The Institute of computer vision is mainly involved in the field of deep learning , Mainly devoted to face detection 、 Face recognition , Multi target detection 、 Target tracking 、 Image segmentation and other research directions . The Research Institute will continue to share the latest paper algorithm new framework , The difference of our reform this time is , We need to focus on ” Research “. After that, we will share the practice process for the corresponding fields , Let us really experience the real scene of getting rid of the theory , Develop the habit of hands-on programming and brain thinking !

VX:2311123606

34c598654e310ff19f98c6ae1e58fb91.png

Previous recommendation  

原网站

版权声明
本文为[Computer Vision Research Institute]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/182/202207011054152722.html