当前位置:网站首页>Based on easycv to reproduce Detr and dab-detr, the correct opening method of object query
Based on easycv to reproduce Detr and dab-detr, the correct opening method of object query
2022-07-25 18:59:00 【51CTO】
DETR It is the latest target detection framework in recent years , The first real end-to-end Detection Algorithm , Save the tedious RPN、anchor and NMS Wait for the operation , Directly input the picture output detection box .DETR Our success is mainly due to Transformer Powerful modeling capabilities , And the Hungarian matching algorithm solves how to learn one-to-one Match detection box and target box .
although DETR Can achieve with Mask R-CNN Quite accurate , But training 500 individual epoch、 Slow convergence , The problem of low accuracy of small targets has been criticized . A series of subsequent work is carried out around these issues , One of the most exciting is Deformable DETR, It is also a must for today's test ,Deformable DETR Our contribution is not just to Deformable Conv Extended to Transformer On , More importantly, it provides a lot of good training DETR Techniques for detecting frameworks , Like imitation Mask R-CNN Framework of the two-stage practice , How to integrate query embed Split into content and reference points Two parts , How to integrate DETR Expand to multi-scale training , And through look forward once Conduct boxes Prediction and other skills , stay Deformable DETR after , Everyone seems to have found out how to open DETR The right way to frame . Among them the object query What does it mean , And how to make better use of object query Make a test , Produced a lot of valuable work , such as Anchor DETR、Conditional DETR wait , among DAB-DETR It is particularly thorough .DAB-DETR take object query As a content and reference points Two parts , among reference points The displayed representation is xywh four-dimensional vector , And then through decoder forecast xywh The residual of is iteratively updated to the detection box , And through xywh Vectors introduce positional attention , help DETR Speed up convergence , This article will be based on EasyCV Recurring DETR and DAB-DETR The algorithm details how to use it correctly object query To enhance DETR Check the performance of the framework .
DETR
DETR Use set loss function As a monitoring signal for end-to-end training , Then predict all goals at the same time , among set loss function Use bipartite matching Algorithm will pred Objectives and gt Match the goals . Directly regard the target detection task as set prediction problem , Make the training process simple , And avoid anchor、NMS Etc .
DETR The main contribution has two parts :architecture and set prediction loss.
1.Architecture

DETR First use CNN The input image embedding Into a two-dimensional representation , Then convert the two-dimensional representation into one-dimensional representation and combine positional encoding Send together encoder,decoder Put a small fixed number of learned object queries( It can be understood as positional embeddings) and encoder Output as input . The final will be decoder Every... You get output embdding To a shared feedforward network (FFN), The network can predict a detection result ( Include classes and borders ) Or the “ No target ” Class .
1.1 Transformer

1.1.1 Encoder
take Backbone Output feature map Convert to one-dimensional representation , obtain Characteristics of figure , Then combine positional encoding As Encoder The input of . Every Encoder All by Multi-Head Self-Attention and FFN form . and Transformer Encoder The difference is , because Encoder It has position invariance ,DETR take positional encoding Add to each Multi-Head Self-Attention in , To ensure the position sensitivity of target detection .
1.1.2 Decoder
because Decoder It also has position invariance ,Decoder Of individual object query( It can be understood that learning is different object Of positional embedding) Must be different , In order to generate different object Of embedding, And add them to each at the same time Multi-Head Attention in .
individual object queries adopt Decoder Convert to a output embedding, then output embedding adopt FFN Independently decode
A prediction , contain box and class. For input embedding Use at the same time Self-Attention and Encoder-Decoder Attention, The model can make use of the relationship between targets to carry out global reasoning . and Transformer Decoder The difference is ,DETR Each Decoder Parallel output
Objects ,Transformer Decoder Using an autoregressive model , Serial output
Objects , Only one element of one output sequence can be predicted at a time .
1.1.3 FFNFFN
from 3 layer perceptron And the first floor linear projection form .FFN Predict box Normalized center coordinates of 、 Long 、 generous and easygoing class.DETR The forecast is a fixed number N individual box Set , also N Usually larger than the actual target number ( among DETR The default setting is 100 individual , and DAB-DETR Set to 300 individual ), And an extra empty class is used to represent the predicted box There is no goal .
2.Set prediction loss
DETR The main difficulty of model training is how to base on gt Measure forecast results ( Category 、 Location 、 Number ).DETR Proposed loss Function can produce pred and gt Optimal bilateral matching ( determine pred and gt The one-to-one relationship of ), And then optimize loss. take Expressed as gt Set , Expressed as
A set of prediction results . hypothesis
Greater than the number of image targets ,
It can be considered as using empty classes ( No goal ) The size of the fill is N Set . Search two sets
Elements
Different arrangement order of , bring loss The smallest possible order of permutation is the maximum match of bipartite graph (Bipartite Matching), The formula is as follows :

among
Express pred and gt About Elements
The matching of loss. The bipartite graph is matched by Hungarian algorithm (Hungarian algorithm) obtain . matching loss At the same time, I considered pred class and pred box The accuracy of the . Every gt The elements of i Can be seen as
,
Express class label( It may be an empty class )
Express gt box, Put the element
The bipartite graph matches the specified pred class Expressed as
,pred box Expressed as
.
The first step is to find a one-to-one match pred and gt, The second step is to calculate hungarian loss.hungarian loss The formula is as follows :

among Combined with the L1 loss and generalized IoU loss, The formula is as follows :

DAB-DETR
DAB-DETR take object query As a content and reference points Two parts , among reference points The displayed representation is xywh four-dimensional vector , And then through decoder forecast xywh The residual of is iteratively updated to the detection box , And through xywh Vectors introduce positional attention , help DETR Speed up convergence .


stay DAB-DETR Before , There is a lot of work on how to set reference points Have made in-depth exploration :Conditional DETR adopt 256 The learnable vector of dimension learns xy Reference point , Then the location information is introduced transformer decoder in ;Anchor DETR The reference point is regarded as xy, And then get through learning 256 Dimension vector , Introduce location information into transformer decoder in , And through the step-by-step iteration, we get the xy;Defomable DETR It is through 256 The vectorial learning vector gets xywh Reference resources anchor, The detection frame is obtained through step-by-step iteration ;DAB-DETR Is more thorough , Absorb the advantages of hundreds of families , adopt xywh Study 256 Dimension vector , Introduce location information into transformer decoder in , And the detection frame is obtained through step-by-step iteration . thus ,reference points The way of using is becoming clearer , The displayed representation is xywh, Then learn to 256 Dimension vector , Introduce location information , Each layer transformer decoder Study xywh Residual of , The final detection frame is obtained by stacking step by step .

in addition ,DAB-DETR In order to make full use of xywh This is more revealing reference points Representation , Further introduced Width & Height-Modulated Multi-Head Cross-Attention, In fact, simply speaking, it is in cross-attention Introduction position in xywh Get position attention , This improvement can be greatly accelerated decoder The rate of convergence , Because the original DETR It is equivalent to learning positional attention in the whole picture ,DAB-DETR You can focus directly on key positions , This is also Deformable DETR The reason why convergence can be accelerated , The essence is that the more critical sparse position sampling can speed up decoder Convergence rate .
Repeat the results

Tutorial
Next , We will use a practical example to show how to base on EasyCV Conduct DAB-DETR Algorithm training , You can also link See the detailed steps .
One 、 Install dependency packages
If you are running in a local development environment , You can refer to the link Installation environment . If you use PAI-DSW There is no need to install related dependencies for the experiment , stay PAI-DSW docker Relevant environment has been built in . Two 、 Data preparation
You can download COCO2017 data , You can also use the example we provided COCO data
data/coco The format is as follows :
Two 、 Model training and evaluation
With vitdet-base For example . stay EasyCV in , Use the form of configuration file to realize the control of model parameters 、 Data input and augmentation methods 、 Configuration of training strategy , Only by modifying the parameter settings in the configuration file , You can complete the experimental configuration for training . You can download the sample configuration file directly .
see easycv Installation position
Execute training orders
Execute the evaluation order
Reference
Code implementation :
DETR https://github.com/alibaba/EasyCV/tree/master/easycv/models/detection/detectors/detr
DAB-DETR https://github.com/alibaba/EasyCV/tree/master/easycv/models/detection/detectors/dab_detr
EasyCV Previous sharing
be based on EasyCV Reappear ViTDet: Single layer features surpass FPN https://zhuanlan.zhihu.com/p/528733299
MAE Introduction and implementation of self-monitoring algorithm based on EasyCV The recurrence of https://zhuanlan.zhihu.com/p/515859470
EasyCV Open source | Visual self-monitoring out of the box +Transformer Algorithm library https://zhuanlan.zhihu.com/p/505219993
边栏推荐
- 浅析IM即时通讯开发出现上网卡顿?网络掉线?
- qt exec和show的区别
- How to design product help center? The following points cannot be ignored
- Detailed explanation of Bluetooth protocol (what is Bluetooth)
- Microsoft azure and Analysys jointly released the report "Enterprise Cloud native platform driven digital transformation"
- Project: serial port receiving RAM storage TFT display (complete design)
- 韩国AI团队抄袭震动学界!1个导师带51个学生,还是抄袭惯犯
- Vc/pe is running towards Qingdao
- 从目标检测到图像分割简要发展史
- Pixel2Mesh从单个RGB图像生成三维网格ECCV2018
猜你喜欢

华为交换机系统软件升级和安全漏洞修复教程

【919. 完全二叉树插入器】

基于Mysql-Exporter监控Mysql

The Yellow Crane Tower has a super shocking perspective. You've never seen such a VR panorama!

2022 IAA industry category development insight series report - phase II

Microsoft azure and Analysys jointly released the report "Enterprise Cloud native platform driven digital transformation"

【开源工程】STM32C8T6+ADC信号采集+OLED波形显示

ThreadLocal Kills 11 consecutive questions

Baklib:制作优秀的产品说明手册

With 8 years of product experience, I have summarized these practical experience of continuous and efficient research and development
随机推荐
浏览器内核有几种,浏览器版本过低怎么升级
Virtual machine VMware installation steps (how to install software in virtual machine)
qt之编译成功但程序无法运行
如何创建一个有效的帮助文档?
给生活加点惊喜,做创意生活的原型设计师丨编程挑战赛 x 选手分享
GDB help
jmeter性能测试实战视频(常用性能测试工具有哪些)
How to design product help center? The following points cannot be ignored
优维低代码:Use Resolves
Communication between processes (pipeline communication)
Typescript reflection object reflection use
有孚原力超算,为客户提供定制化高性能计算服务
#夏日挑战赛#【FFH】这个盛夏,来一场“清凉”的代码雨!
接口自动化测试平台FasterRunner系列(一)- 简介、安装部署、启动服务、访问地址、配置补充
阿里云技术专家邓青琳:云上跨可用区容灾和异地多活最佳实践
什么是hpaPaaS平台?
Circulaindicator component, which makes the indicator style more diversified
SQL 实现 Excel 的10个常用功能,附面试原题
How to create an effective help document?
The auction house is a VC, and the first time it makes a move, it throws a Web3