2022-07-03 14:53:00 【Hali_Botebie】
论文名称:End-to-End Object Detection with Transformers
将目标检测任务转化为一个序列预测(set prediction)的任务,使用transformer编码-解码器结构和双边匹配的方法,由输入图像直接得到预测结果序列。和SOTA的检测方法不同,没有proposal(Faster R-CNN),没有anchor(YOLO),没有center(CenterNet),也没有繁琐的NMS,直接预测检测框和类别,利用二分图匹配的匈牙利算法,将CNN和transformer巧妙的结合,实现目标检测的任务。
在本文的检测框架中,有两个至关重要的因素:①使预测框和ground truth之间一对一匹配的序列预测loss;②预测一组目标序列,并对它们之间关系进行建模的网络结构。接下来依次介绍这两个因素的设计方法。
网络的主要组成是CNN和Transformer,Transformer借助self-attention机制,可以显式地对一个序列中的所有elements两两之间的interactions进行建模,使得这类transformer的结构非常适合带约束的set prediction的问题。DETR的特点是:一次预测,端到端训练,set loss function和二分匹配。
Backbone -> transformer -> Prediction
CNN ->encoder+decoder -> FFN
- 3 × W × H
输出特征图f 的尺寸
- 2048 × (W/3) × (H/3)
The detailed description of the transformer used in DETR, with positional encodings passed at every attention layer.
Image features from the CNN backbone are passed through the transformer encoder, together with spatial positional encoding that are added to queries and keys at every multihead self-attention layer.
(CNN 特征被添加到每个多头自注意力层(MSAL)的queries 和keys 中)
Then, the decoder receives queries (initially set to zero), output positional encoding (object queries), and encoder memory, and produces the final set of predicted class labels and bounding boxes through multiple multihead self-attention and decoder-encoder attention. The first self-attention layer in the first decoder layer can be skipped.
Transformer encoder部分首先将输入的特征图降维并flatten,然后送入下图左半部分所示的结构中,和空间位置编码一起并行经过多个自注意力分支、正则化和FFN,得到一组长度为N的预测目标序列。其中,每个自注意力分支的工作原理为可参考刘岩:详解Transformer (Attention Is All You Need),也可以参照论文:https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
接着,将Transformer encoder得到的预测目标序列经过上图右半部分所示的Transformer decoder,并行的解码得到输出序列(而不是像机器翻译那样逐个元素输出)。和传统的autogreesive机制不同,每个层可以解码N个目标,由于解码器的位置不变性,即调换输入顺序结果不变,除了每个像素本身的信息,位置信息也很重要,所以这N个输入嵌入必须不同以产生不同的结果,所以学习NLP里面的方法,加入positional encoding并且每层都加,作者非常用力的在处理position的问题,在使用 transformer 处理图片类的输入的时候,一定要注意position的问题。
使用共享参数的FFNs(由一个具有ReLU激活函数和d维隐藏层的3层感知器和一个线性投影层构成)独立解码为包含类别得分和预测框坐标的最终检测结果(N个),FFN预测框的标准化中心坐标,高度和宽度w.r.t. 输入图像,然后线性层使用softmax函数预测类标签。
L b o x L_{box} Lbox采用的是Generalized intersection over union论文提出的GIOU[2],关于GIOU后面会大致介绍。
本文中,作者主要和目标检测经典框架faster rcnn进行了对比,结果如下(其中带有后缀DC5的方法表示在主干网络的最后一个阶段加入一个dilation,并从这个阶段的第一个卷积中去除一个stride来增加特征分辨率):
由上图可知,DETR框架虽然简洁,但效果与经典方法faster rcnn不相上下,其中DETR对于大目标的检测效果有所提升,但在小目标的检测中表现较差。该文提出的方法十分新颖,使用类似机器翻译的序列预测思想,打破了目标检测的传统思想,减少检测器对先验性息和后处理的依赖,使目标检测框架更加简洁的同时获得了与faster rcnn相媲美的效果。
- 与IoU相似,GIoU也是一种距离度量,作为损失函数的话,满足损失函数的基本要求
- GIoU对scale不敏感
- GIoU是IoU的下界,在两个框无线重合的情况下,IoU=GIoU
- IoU取值[0,1],但GIoU有对称区间,取值范围[-1,1]。在两者重合的时候取最大值1,在两者无交集且无限远的时候取最小值-1,因此GIoU是一个非常好的距离度量指标。
- 与IoU只关注重叠区域不同,GIoU不仅关注重叠区域,还关注其他的非重合区域,能更好的反映两者的重合度。
以最少的行数演示 DETR 的实现,与论文中关于 DETR 的以下差异:
- 学习位置编码(而不是正弦)
- 位置编码在输入时传递(而不是注意力)
- fc bbox 预测器(而不是 MLP)
class DETRdemo(nn.Module):
Demo DETR implementation.
Demo implementation of DETR in minimal number of lines, with the
following differences wrt DETR in the paper:
* learned positional encoding (instead of sine)
* positional encoding is passed at input (instead of attention)
* fc bbox predictor (instead of MLP)
The model achieves ~40 AP on COCO val5k and runs at ~28 FPS on Tesla V100.
Only batch size 1 supported.
def __init__(self, num_classes, hidden_dim=256, nheads=8,
num_encoder_layers=6, num_decoder_layers=6):
# create ResNet-50 backbone
self.backbone = resnet50()
del self.backbone.fc
# create conversion layer
self.conv = nn.Conv2d(2048, hidden_dim, 1)
# create a default PyTorch transformer
self.transformer = nn.Transformer(
hidden_dim, nheads, num_encoder_layers, num_decoder_layers)
# prediction heads, one extra class for predicting non-empty slots
# note that in baseline DETR linear_bbox layer is 3-layer MLP
self.linear_class = nn.Linear(hidden_dim, num_classes + 1)
self.linear_bbox = nn.Linear(hidden_dim, 4)
# output positional encodings (object queries)
self.query_pos = nn.Parameter(torch.rand(100, hidden_dim))
# spatial positional encodings
# note that in baseline DETR we use sine positional encodings
self.row_embed = nn.Parameter(torch.rand(50, hidden_dim // 2))
self.col_embed = nn.Parameter(torch.rand(50, hidden_dim // 2))
def forward(self, inputs):
# propagate inputs through ResNet-50 up to avg-pool layer
x = self.backbone.conv1(inputs)
x = self.backbone.bn1(x)
x = self.backbone.relu(x)
x = self.backbone.maxpool(x)
x = self.backbone.layer1(x)
x = self.backbone.layer2(x)
x = self.backbone.layer3(x)
x = self.backbone.layer4(x)
# convert from 2048 to 256 feature planes for the transformer
h = self.conv(x)
# construct positional encodings
H, W = h.shape[-2:]
pos = torch.cat([
self.col_embed[:W].unsqueeze(0).repeat(H, 1, 1),
self.row_embed[:H].unsqueeze(1).repeat(1, W, 1),
], dim=-1).flatten(0, 1).unsqueeze(1)
# propagate through the transformer
h = self.transformer(pos + 0.1 * h.flatten(2).permute(2, 0, 1),
self.query_pos.unsqueeze(1)).transpose(0, 1)
# finally project transformer outputs to class labels and bounding boxes
return {'pred_logits': self.linear_class(h),
'pred_boxes': self.linear_bbox(h).sigmoid()}
MLP (also called FFN)
Very simple multi-layer perceptron (also called FFN)
import torch
from torchvision.models import resnet
from pytorch2caffe import pytorch2caffe
import torch.nn as nn
import torch.nn.functional as F
# 定义网络
class MLP(nn.Module):
""" Very simple multi-layer perceptron (also called FFN)"""
def __init__(self, input_dim, hidden_dim, output_dim, num_layers):
self.num_layers = num_layers
h = [hidden_dim] * (num_layers - 1)
self.layers = nn.ModuleList(nn.Linear(n, k) for n, k in zip([input_dim] + h, h + [output_dim]))
def forward(self, x):
for i, layer in enumerate(self.layers):
x = F.relu(layer(x)) if i < self.num_layers - 1 else layer(x)
return x
net = MLP(1,2,3,4)
Current directory: E:\Git_repo\pytorch-cifar100\torch2caffe
activate.bat OCR_ONNX_CUDA & cd "E:\Git_repo\pytorch-cifar100\torch2caffe" & python MLP.py
Process started (PID=14236) >>>
(layers): ModuleList(
(0): Linear(in_features=1, out_features=2, bias=True)
(1): Linear(in_features=2, out_features=2, bias=True)
(2): Linear(in_features=2, out_features=2, bias=True)
(3): Linear(in_features=2, out_features=3, bias=True)
<<< Process finished (PID=14236). (Exit code 0)
SetCriterion( loss )
This class computes the loss for DETR.
The process happens in two steps:
1) we compute hungarian assignment between ground truth boxes and the outputs of the model
2) we supervise each pair of matched ground-truth / prediction (supervise class and box)
1、 def loss_labels(self, outputs, targets, indices, num_boxes, log=True):
“”“Classification loss (NLL)
targets dicts must contain the key “labels” containing a tensor of dim [nb_target_boxes]
2、 def loss_boxes(self, outputs, targets, indices, num_boxes):
“”“Compute the losses related to the bounding boxes, the L1 regression loss and the GIoU loss
targets dicts must contain the key “boxes” containing a tensor of dim [nb_target_boxes, 4]
The target boxes are expected in format (center_x, center_y, w, h), normalized by the image size.
3、 def loss_masks(self, outputs, targets, indices, num_boxes):
“”“Compute the losses related to the masks: the focal loss and the dice loss.
targets dicts must contain the key “masks” containing a tensor of dim [nb_target_boxes, h, w]
“”" This is the DETR module that performs object detection “”"
class PositionEmbeddingLearned(nn.Module):
Absolute pos embedding, learned.
def __init__(self, num_pos_feats=256):
self.row_embed = nn.Embedding(50, num_pos_feats)
self.col_embed = nn.Embedding(50, num_pos_feats)
def reset_parameters(self):
def forward(self):#, tensor_list: NestedTensor):
# x = tensor_list.tensors
x = torch.ones([2,2])
h, w = x.shape[-2:]
i = torch.arange(w, device=x.device)
j = torch.arange(h, device=x.device)
x_emb = self.col_embed(i)
y_emb = self.row_embed(j)
print(torch.cat([x_emb.unsqueeze(0).repeat(h, 1, 1),y_emb.unsqueeze(1).repeat(1, w, 1)], dim=-1))
# pos = torch.cat([
# x_emb.unsqueeze(0).repeat(h, 1, 1),
# y_emb.unsqueeze(1).repeat(1, w, 1),
# ], dim=-1).permute(2, 0, 1).unsqueeze(0).repeat(x.shape[0], 1, 1, 1)
# return pos
出于效率原因,目标不包括 no_object。 因此,一般来说,预测多于目标。 在这种情况下,我们对最佳预测进行 1 对 1 匹配,而其他预测不匹配(因此被视为非对象)。
(encoder): TransformerEncoder(
(layers): ModuleList(
(0): TransformerEncoderLayer(
(self_attn): MultiheadAttention(
(out_proj): _LinearWithBias(in_features=512, out_features=512, bias=True)
(linear1): Linear(in_features=512, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=512, bias=True)
(norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
(decoder): TransformerDecoder(
(layers): ModuleList(
(0): TransformerDecoderLayer(
(self_attn): MultiheadAttention(
(out_proj): _LinearWithBias(in_features=512, out_features=512, bias=True)
(multihead_attn): MultiheadAttention(
(out_proj): _LinearWithBias(in_features=512, out_features=512, bias=True)
(linear1): Linear(in_features=512, out_features=2048, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(linear2): Linear(in_features=2048, out_features=512, bias=True)
(norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(norm3): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.1, inplace=False)
(dropout3): Dropout(p=0.1, inplace=False)
