当前位置：网站首页>Based on easycv to reproduce Detr and dab-detr, the correct opening method of object query

Based on easycv to reproduce Detr and dab-detr, the correct opening method of object query

2022-07-25 18:59:00 【51CTO】

DETR It is the latest target detection framework in recent years , The first real end-to-end Detection Algorithm , Save the tedious RPN、anchor and NMS Wait for the operation , Directly input the picture output detection box .DETR Our success is mainly due to Transformer Powerful modeling capabilities , And the Hungarian matching algorithm solves how to learn one-to-one Match detection box and target box .

although DETR Can achieve with Mask R-CNN Quite accurate , But training 500 individual epoch、 Slow convergence , The problem of low accuracy of small targets has been criticized . A series of subsequent work is carried out around these issues , One of the most exciting is Deformable DETR, It is also a must for today's test ,Deformable DETR Our contribution is not just to Deformable Conv Extended to Transformer On , More importantly, it provides a lot of good training DETR Techniques for detecting frameworks , Like imitation Mask R-CNN Framework of the two-stage practice , How to integrate query embed Split into content and reference points Two parts , How to integrate DETR Expand to multi-scale training , And through look forward once Conduct boxes Prediction and other skills , stay Deformable DETR after , Everyone seems to have found out how to open DETR The right way to frame . Among them the object query What does it mean , And how to make better use of object query Make a test , Produced a lot of valuable work , such as Anchor DETR、Conditional DETR wait , among DAB-DETR It is particularly thorough .DAB-DETR take object query As a content and reference points Two parts , among reference points The displayed representation is xywh four-dimensional vector , And then through decoder forecast xywh The residual of is iteratively updated to the detection box , And through xywh Vectors introduce positional attention , help DETR Speed up convergence , This article will be based on EasyCV Recurring DETR and DAB-DETR The algorithm details how to use it correctly object query To enhance DETR Check the performance of the framework .

DETR

DETR Use set loss function As a monitoring signal for end-to-end training , Then predict all goals at the same time , among set loss function Use bipartite matching Algorithm will pred Objectives and gt Match the goals . Directly regard the target detection task as set prediction problem , Make the training process simple , And avoid anchor、NMS Etc .

DETR The main contribution has two parts ：architecture and set prediction loss.

1.Architecture

be based on EasyCV Reappear DETR and DAB-DETR,Object Query The right way to open _ data

DETR First use CNN The input image embedding Into a two-dimensional representation , Then convert the two-dimensional representation into one-dimensional representation and combine positional encoding Send together encoder,decoder Put a small fixed number of learned object queries( It can be understood as positional embeddings) and encoder Output as input . The final will be decoder Every... You get output embdding To a shared feedforward network (FFN), The network can predict a detection result ( Include classes and borders ) Or the “ No target ” Class .

1.1 Transformer

be based on EasyCV Reappear DETR and DAB-DETR,Object Query The right way to open _ data _02

1.1.1 Encoder

take Backbone Output feature map Convert to one-dimensional representation , obtain Characteristics of figure , Then combine positional encoding As Encoder The input of . Every Encoder All by Multi-Head Self-Attention and FFN form . and Transformer Encoder The difference is , because Encoder It has position invariance ,DETR take positional encoding Add to each Multi-Head Self-Attention in , To ensure the position sensitivity of target detection .

       
       #  First floor encoder The code is as follows 
       
class TransformerEncoderLayer(nn.Module):
       
    def __init__(self,
       
                 d_model,
       
                 nhead,
       
                 dim_feedforward=2048,
       
                 dropout=0.1,
       
                 activation='relu',
       
                 normalize_before=False):
       
        super().__init__()
       
        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
       
        # Implementation of Feedforward model
       
        self.linear1 = nn.Linear(d_model, dim_feedforward)
       
        self.dropout = nn.Dropout(dropout)
       
        self.linear2 = nn.Linear(dim_feedforward, d_model)
       
        self.norm1 = nn.LayerNorm(d_model)
       
        self.norm2 = nn.LayerNorm(d_model)
       
        self.dropout1 = nn.Dropout(dropout)
       
        self.dropout2 = nn.Dropout(dropout)
       
        self.activation = _get_activation_fn(activation)
       
        self.normalize_before = normalize_before
       
    def with_pos_embed(self, tensor, pos: Optional[Tensor]):
       
        return tensor if pos is None else tensor + pos
       
    def forward(self,
       
                src,
       
                src_mask: Optional[Tensor] = None,
       
                src_key_padding_mask: Optional[Tensor] = None,
       
                pos: Optional[Tensor] = None):
       
        q = k = self.with_pos_embed(src, pos)
       
        src2 = self.self_attn(
       
            q,
       
            k,
       
            value=src,
       
            attn_mask=src_mask,
       
            key_padding_mask=src_key_padding_mask)[0]
       
        src = src + self.dropout1(src2)
       
        src = self.norm1(src)
       
        src2 = self.linear2(self.dropout(self.activation(self.linear1(src))))
       
        src = src + self.dropout2(src2)
       
        src = self.norm2(src)
       
        return src
      
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.

1.1.2 Decoder

because Decoder It also has position invariance ,Decoder Of individual object query( It can be understood that learning is different object Of positional embedding) Must be different , In order to generate different object Of embedding, And add them to each at the same time Multi-Head Attention in . individual object queries adopt Decoder Convert to a output embedding, then output embedding adopt FFN Independently decode A prediction , contain box and class. For input embedding Use at the same time Self-Attention and Encoder-Decoder Attention, The model can make use of the relationship between targets to carry out global reasoning . and Transformer Decoder The difference is ,DETR Each Decoder Parallel output Objects ,Transformer Decoder Using an autoregressive model , Serial output Objects , Only one element of one output sequence can be predicted at a time .

       
       #  First floor decoder The code is as follows 
       
class TransformerDecoderLayer(nn.Module):
       
    def __init__(self,
       
                 d_model,
       
                 nhead,
       
                 dim_feedforward=2048,
       
                 dropout=0.1,
       
                 activation='relu',
       
                 normalize_before=False):
       
        super().__init__()
       
        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
       
        self.multihead_attn = nn.MultiheadAttention(
       
            d_model, nhead, dropout=dropout)
       
        # Implementation of Feedforward model
       
        self.linear1 = nn.Linear(d_model, dim_feedforward)
       
        self.dropout = nn.Dropout(dropout)
       
        self.linear2 = nn.Linear(dim_feedforward, d_model)
       
        self.norm1 = nn.LayerNorm(d_model)
       
        self.norm2 = nn.LayerNorm(d_model)
       
        self.norm3 = nn.LayerNorm(d_model)
       
        self.dropout1 = nn.Dropout(dropout)
       
        self.dropout2 = nn.Dropout(dropout)
       
        self.dropout3 = nn.Dropout(dropout)
       
        self.activation = _get_activation_fn(activation)
       
        self.normalize_before = normalize_before
       
    def with_pos_embed(self, tensor, pos: Optional[Tensor]):
       
        return tensor if pos is None else tensor + pos
       
    def forward(self,
       
                 tgt,
       
                 memory,
       
                 tgt_mask: Optional[Tensor] = None,
       
                 memory_mask: Optional[Tensor] = None,
       
                 tgt_key_padding_mask: Optional[Tensor] = None,
       
                 memory_key_padding_mask: Optional[Tensor] = None,
       
                 pos: Optional[Tensor] = None,
       
                 query_pos: Optional[Tensor] = None):
       
        q = k = self.with_pos_embed(tgt, query_pos)
       
        tgt2 = self.self_attn(
       
            q,
       
            k,
       
            value=tgt,
       
            attn_mask=tgt_mask,
       
            key_padding_mask=tgt_key_padding_mask)[0]
       
        tgt = tgt + self.dropout1(tgt2)
       
        tgt = self.norm1(tgt)
       
        tgt2 = self.multihead_attn(
       
            query=self.with_pos_embed(tgt, query_pos),
       
            key=self.with_pos_embed(memory, pos),
       
            value=memory,
       
            attn_mask=memory_mask,
       
            key_padding_mask=memory_key_padding_mask)[0]
       
        tgt = tgt + self.dropout2(tgt2)
       
        tgt = self.norm2(tgt)
       
        tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt))))
       
        tgt = tgt + self.dropout3(tgt2)
       
        tgt = self.norm3(tgt)
       
        return tgt
      
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
58.
59.
60.
61.
62.

1.1.3 FFNFFN

from 3 layer perceptron And the first floor linear projection form .FFN Predict box Normalized center coordinates of 、 Long 、 generous and easygoing class.DETR The forecast is a fixed number $N$ individual box Set , also $N$ Usually larger than the actual target number ( among DETR The default setting is 100 individual , and DAB-DETR Set to 300 individual ), And an extra empty class is used to represent the predicted box There is no goal .

       
       class MLP(nn.Module):
       
    """ Very simple multi-layer perceptron (also called FFN)"""
       
    def __init__(self, input_dim, hidden_dim, output_dim, num_layers):
       
        super().__init__()
       
        self.num_layers = num_layers
       
        h = [hidden_dim] * (num_layers - 1)
       
        self.layers = nn.ModuleList(
       
            nn.Linear(n, k) for n, k in zip([input_dim] + h, h + [output_dim]))
       
    def forward(self, x):
       
        for i, layer in enumerate(self.layers):
       
            x = F.relu(layer(x)) if i < self.num_layers - 1 else layer(x)
       
        return x
      
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.

2.Set prediction loss

DETR The main difficulty of model training is how to base on gt Measure forecast results ( Category 、 Location 、 Number ).DETR Proposed loss Function can produce pred and gt Optimal bilateral matching ( determine pred and gt The one-to-one relationship of ), And then optimize loss. take Expressed as gt Set , Expressed as A set of prediction results . hypothesis Greater than the number of image targets , It can be considered as using empty classes ( No goal ) The size of the fill is $N$ Set . Search two sets Elements be based on EasyCV Reappear DETR and DAB-DETR,Object Query The right way to open _ data _13 Different arrangement order of , bring loss The smallest possible order of permutation is the maximum match of bipartite graph (Bipartite Matching), The formula is as follows ：

be based on EasyCV Reappear DETR and DAB-DETR,Object Query The right way to open _ iteration _14

among be based on EasyCV Reappear DETR and DAB-DETR,Object Query The right way to open _ iteration _15 Express pred and gt About Elements The matching of loss. The bipartite graph is matched by Hungarian algorithm (Hungarian algorithm) obtain . matching loss At the same time, I considered pred class and pred box The accuracy of the . Every gt The elements of $i$ Can be seen as , Express class label( It may be an empty class ) Express gt box, Put the element The bipartite graph matches the specified pred class Expressed as be based on EasyCV Reappear DETR and DAB-DETR,Object Query The right way to open _ iteration _22 ,pred box Expressed as .

The first step is to find a one-to-one match pred and gt, The second step is to calculate hungarian loss.hungarian loss The formula is as follows ：

be based on EasyCV Reappear DETR and DAB-DETR,Object Query The right way to open _ iteration _24

among Combined with the L1 loss and generalized IoU loss, The formula is as follows ：

be based on EasyCV Reappear DETR and DAB-DETR,Object Query The right way to open _python_25

       
       # HungarianMatcher By calculating cost_bbox,cost_class,cost_giou One to one match prediction box and gt box , Then return the matching index pair , Finally, the index pair is used to calculate loss value 
       
# Final cost matrix
       
C = self.cost_bbox * cost_bbox + self.cost_class * cost_class + self.cost_giou * cost_giou
       
C = C.view(bs, num_queries, -1).cpu()
       
sizes = [len(v['boxes']) for v in targets]
       
indices = [
       
    linear_sum_assignment(c[i])
       
    for i, c in enumerate(C.split(sizes, -1))
       
]
       
return [(torch.as_tensor(i, dtype=torch.int64),
       
         torch.as_tensor(j, dtype=torch.int64)) for i, j in indices]
      
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.

DAB-DETR

DAB-DETR take object query As a content and reference points Two parts , among reference points The displayed representation is xywh four-dimensional vector , And then through decoder forecast xywh The residual of is iteratively updated to the detection box , And through xywh Vectors introduce positional attention , help DETR Speed up convergence .

be based on EasyCV Reappear DETR and DAB-DETR,Object Query The right way to open _ data _26

be based on EasyCV Reappear DETR and DAB-DETR,Object Query The right way to open _ data _27

stay DAB-DETR Before , There is a lot of work on how to set reference points Have made in-depth exploration ：Conditional DETR adopt 256 The learnable vector of dimension learns xy Reference point , Then the location information is introduced transformer decoder in ;Anchor DETR The reference point is regarded as xy, And then get through learning 256 Dimension vector , Introduce location information into transformer decoder in , And through the step-by-step iteration, we get the xy;Defomable DETR It is through 256 The vectorial learning vector gets xywh Reference resources anchor, The detection frame is obtained through step-by-step iteration ;DAB-DETR Is more thorough , Absorb the advantages of hundreds of families , adopt xywh Study 256 Dimension vector , Introduce location information into transformer decoder in , And the detection frame is obtained through step-by-step iteration . thus ,reference points The way of using is becoming clearer , The displayed representation is xywh, Then learn to 256 Dimension vector , Introduce location information , Each layer transformer decoder Study xywh Residual of , The final detection frame is obtained by stacking step by step .

       
       # DAB-DETR take object query The displayed split is content and pos Two properties 
       
#  take query_embed The displayed representation is xywh, Express pos attribute , adopt MLP Learn to be 256 Dimensional pos features 
       
self.query_embed = nn.Embedding(num_queries, query_dim)
       
# get sine embedding for the query vector
       
reference_points = self.query_embed.sigmoid()
       
obj_center = reference_points[..., :2]
       
query_sine_embed = gen_sineembed_for_position(obj_center)
       
query_pos = self.ref_point_head(query_sine_embed)
       
# content_embed Initialize to full 0 Of 256 Whitman's sign 
       
tgt = torch.zeros(
       
                    self.num_queries,
       
                    bs,
       
                    self.embed_dims,
       
                    device=query_embed.device)
      
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.

be based on EasyCV Reappear DETR and DAB-DETR,Object Query The right way to open _python_28

in addition ,DAB-DETR In order to make full use of xywh This is more revealing reference points Representation , Further introduced Width & Height-Modulated Multi-Head Cross-Attention, In fact, simply speaking, it is in cross-attention Introduction position in xywh Get position attention , This improvement can be greatly accelerated decoder The rate of convergence , Because the original DETR It is equivalent to learning positional attention in the whole picture ,DAB-DETR You can focus directly on key positions , This is also Deformable DETR The reason why convergence can be accelerated , The essence is that the more critical sparse position sampling can speed up decoder Convergence rate .

       
       #  adopt MLP Learning from , adjustment query_sine_embed Of attn Location , Further accelerate the convergence speed 
       
# modulated HW attentions
       
if self.modulate_hw_attn:
       
    refHW_cond = self.ref_anchor_head(
       
        output).sigmoid()  # nq, bs, 2
       
    query_sine_embed[..., self.d_model //
       
                     2:] *= (refHW_cond[..., 0] /
       
                             obj_center[..., 2]).unsqueeze(-1)
       
    query_sine_embed[..., :self.d_model //
       
                     2] *= (refHW_cond[..., 1] /
       
                            obj_center[..., 3]).unsqueeze(-1)
      
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.

Repeat the results

be based on EasyCV Reappear DETR and DAB-DETR,Object Query The right way to open _ data _29

Tutorial

Next , We will use a practical example to show how to base on EasyCV Conduct DAB-DETR Algorithm training , You can also link See the detailed steps .

One 、 Install dependency packages

If you are running in a local development environment , You can refer to the link Installation environment . If you use PAI-DSW There is no need to install related dependencies for the experiment , stay PAI-DSW docker Relevant environment has been built in . Two 、 Data preparation

You can download COCO2017 data , You can also use the example we provided COCO data

       
       wget http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/data/small_coco_demo/small_coco_demo.tar.gz && tar -zxf small_coco_demo.tar.gz
       
mkdir -p data/  && mv small_coco_demo data/coco
      
1.
2.
3.

data/coco The format is as follows ：

       
       data/coco/
       
├── annotations
       
│   ├── instances_train2017.json
       
│   └── instances_val2017.json
       
├── train2017
       
│   ├── 000000005802.jpg
       
│   ├── 000000060623.jpg
       
│   ├── 000000086408.jpg
       
│   ├── 000000118113.jpg
       
│   ├── 000000184613.jpg
       
│   ├── 000000193271.jpg
       
│   ├── 000000222564.jpg
       
│       ...
       
│   └── 000000574769.jpg
       
└── val2017
       
    ├── 000000006818.jpg
       
    ├── 000000017627.jpg
       
    ├── 000000037777.jpg
       
    ├── 000000087038.jpg
       
    ├── 000000174482.jpg
       
    ├── 000000181666.jpg
       
    ├── 000000184791.jpg
       
    ├── 000000252219.jpg
       
         ...
       
    └── 000000522713.jpg
      
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.

Two 、 Model training and evaluation

With vitdet-base For example . stay EasyCV in , Use the form of configuration file to realize the control of model parameters 、 Data input and augmentation methods 、 Configuration of training strategy , Only by modifying the parameter settings in the configuration file , You can complete the experimental configuration for training . You can download the sample configuration file directly .

see easycv Installation position

       
       #  see easycv Installation position 
       
import easycv
       
print(easycv.__file__)
       
export PYTHONPATH=$PYTHONPATH:root/EasyCV
      
1.
2.
3.
4.
5.

Execute training orders

       
        stand-alone 8 card ：
       
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m 
       
torch.distributed.launch --
       
nproc_per_node=8 --
       
master_port=29500 tools/train.py 
       
configs/detection/dab-
       
detr/dab_detr_r50_8x2_50e_coco.p
       
y --work_dir easycv/dab_detr -
       
-launcher pytorch
       
1.
2.
3.
4.
5.
6.
7.
8.
9.

Execute the evaluation order

       
       CUDA_VISIBLE_DEVICES=0,1,2,3,4,5
       
,6,7 python -m 
       
torch.distributed.launch --
       
nproc_per_node=8 --
       
master_port=29500 tools/eval.py 
       
configs/detection/dab-
       
detr/dab_detr_r50_8x2_50e_coco.p
       
y easycv/dab_detr/epoch_50.pth -
       
-launcher pytorch --eval
      
1.
2.
3.
4.
5.
6.
7.
8.
9.