当前位置:网站首页>“目标检测“+“视觉理解“实现对输入图像的理解
“目标检测“+“视觉理解“实现对输入图像的理解
2022-07-01 03:33:00 【AI视觉网奇】
提出了GLIPv2,一种基于VL的理解模型,它服务于localization任务(例如,目标检测、实例分割)和视觉语言(VL)理解任务(例如,VQA、图像字幕)。
论文地址:https://arxiv.org/pdf/2206.05836.pdf
代码地址:https://github.com/microsoft/GLIP
预训练模型最小的2.5G,
01概述
GLIPv2优雅地将localization预训练和视觉语言预训练 (VLP) 与三个预训练任务相结合:phrase grounding作为检测任务的VL重构,区域-词对比学习作为新的区域-词级对比学习任务和掩码语言建模。这种统一不仅简化了之前的多阶段VLP程序,而且实现了定位和理解任务之间的互惠互利。实验结果表明,单个GLIPv2模型(所有模型权重共享)在各种定位和理解任务上实现了接近SoTA的性能。该模型还展示了:
在开放词汇目标检测任务上的强大的零样本和少样本自适应性能;
在 VL 理解任务上的出色grounding能力
02背景
最近,人们普遍关注构建通用视觉系统,也称为视觉基础模型,它可以同时解决各种视觉任务,例如图像分类、物体检测,以及视觉语言 (VL) 理解。特别感兴趣的是定位任务(例如,目标检测和分割)和VL理解任务(例如,VQA和图像字幕)之间的统一。
localization预训练有利于VL任务,“localization->VLP”两阶段预训练过程是VL社区。一个长期存在的挑战是localization和理解的统一,旨在这两种任务之间互惠互利,简化预训练程序并降低预训练成本。
然而,这两种任务似乎有很大的不同:定位任务仅是视觉任务,需要细粒度的输出(例如,边界框或像素掩码),而VL理解任务强调两种模式之间的融合,需要高级语义输出。例如,答案或标题)。
03新框架

Left: GLIPv2, a pre-trained grounded VL understanding model, unifies various localization and VL understanding tasks. These two kinds of tasks mutually benefit each other, and enables new capabilities such as language-guided detection/segmentation and grounded VQA/captioning. Right: Additional examples from ODinW (detection), LVIS (segmentation), VQA, and COCO Captioning.
A Unified VL Formulation and Architecture
GLIPv2统一公式的核心是分类匹配技巧,它将任何特定于任务的固定词汇分类问题重新表述为与任务无关的开放词汇视觉语言匹配问题。最好的例子是在CLIP中将图像分类重新表述为图像-文本匹配,这使模型能够直接从原始图像-文本数据中学习,并在开放词汇分类任务上实现强大的零样本结果。在GLIPv2 中,我们用视觉语言匹配点积层替换了传统单模态视觉模型中的每个语义分类线性层。

GLIPv2 Pre-training
GLIPv2使用三个预训练损失进行预训练:来自目标检测任务的视觉语言重构的phrase grounding损失Lground、来自新的区域单词级别对比学习任务的区域单词对比损失 Linter,以及标准掩码BERT中提出的语言建模损失Lmlm。

Transfer GLIPv2 to Localization and VL Tasks
我们引入了两种轻松将GLIPv2传输到各种下游任务的方法。此外,GLIPv2可以在本地化的同时执行传统的VL任务(例如VQA),有效地使我们认为的每项任务都成为“基础的VL理解”任务。

GLIPv2 pre-training losses: the intra-image alignment loss Lintra (right) takes features after VL fusion and compute loss over region-word pairs within each image-text pair; the inter-image contrastive loss (left) Linter takes features before VL fusion and compute loss over all region-word pairs across a batch of image-text pairs. Label propagation is used to determine the off-diagonal blocks of the Linter target matrix.
04
实验及可视化



边栏推荐
- Addition without addition, subtraction, multiplication and division
- Quickly filter data such as clock in time and date: Excel filter to find whether a certain time point is within a certain time period
- 30. Concatenate substrings of all words
- 【TA-霜狼_may-《百人計劃》】1.2.1 向量基礎
- 【EI检索】2022年第六届材料工程与先进制造技术国际会议(MEAMT 2022)重要信息会议网址:www.meamt.org会议时间:2022年9月23-25日召开地点:中国南京截稿时间:2
- Processing of menu buttons on the left and contents on the right of the background system page, and double scrolling appears on the background system page
- 166. fractions to decimals
- 【TA-霜狼_may-《百人计划》】1.1 渲染流水线
- 【伸手党福利】JSONObject转String保留空字段
- 241. Design priorities for operational expressions
猜你喜欢

详解Spark运行模式(local+standalone+yarn)

谷粒学院微信扫码登录过程记录以及bug解决
![[TA frost wolf \u may- hundred talents plan] 1.2.2 matrix calculation](/img/49/173b1f1f379faa28c503165a300ce0.png)
[TA frost wolf \u may- hundred talents plan] 1.2.2 matrix calculation

Bilinear upsampling and f.upsample in pytorch_ bilinear

【TA-霜狼_may-《百人计划》】1.2.3 MVP矩阵运算

快速筛选打卡时间日期等数据:EXCEL筛选查找某一时间点是否在某一时间段内

Leetcode 31 next spread, leetcode 64 minimum path sum, leetcode 62 different paths, leetcode 78 subset, leetcode 33 search rotation sort array (modify dichotomy)
![Pyramid scene parsing network [pspnet] thesis reading](/img/05/4645c8a595083479dee6835620335d.png)
Pyramid scene parsing network [pspnet] thesis reading
![[deep learning] activation function (sigmoid, etc.), forward propagation, back propagation and gradient optimization; optimizer. zero_ grad(), loss. backward(), optimizer. Function and principle of st](/img/9f/187ca83be1b88630a6c6fbfb0620ed.png)
[deep learning] activation function (sigmoid, etc.), forward propagation, back propagation and gradient optimization; optimizer. zero_ grad(), loss. backward(), optimizer. Function and principle of st

用小程序的技术优势发展产业互联网
随机推荐
Promql select time series
30. Concatenate substrings of all words
Pyramid scene parsing network [pspnet] thesis reading
171. Excel 表列序号
[TA frost wolf \u may- hundred people plan] 1.3 secret of texture
谷粒学院微信扫码登录过程记录以及bug解决
What happens when a function is called before it is declared in C?
[small sample segmentation] interpretation of the paper: prior guided feature enrichment network for fee shot segmentation
431. 将 N 叉树编码为二叉树 DFS
Develop industrial Internet with the technical advantages of small programs
Binary tree god level traversal: Morris traversal
10. regular expression matching
【EI会议】2022年国际土木与海洋工程联合会议(JCCME 2022)
在 C 中声明函数之前调用函数会发生什么?
Addition without addition, subtraction, multiplication and division
409. longest palindrome
Usage of AfxMessageBox and MessageBox
You cannot right-click F12 to view the source code solution on the web page
Leetcode 31 next spread, leetcode 64 minimum path sum, leetcode 62 different paths, leetcode 78 subset, leetcode 33 search rotation sort array (modify dichotomy)
Gorilla/mux framework (RK boot): RPC error code design