当前位置:网站首页>OD-Model [4]: SSD
OD-Model [4]: SSD
2022-08-02 03:16:00 【zzzyzh】
SSD: Single Shot MultiBox Detector
1. Abstract & Introduction
1.1. Abstract
We propose a method for detecting objects in images using a single deep neural network.我们的方法名为SSD,It discretizes the output space of bounding boxes of different aspect ratios and scales into a set of default boxes.在预测时,The network generates scores for the presence of each object category in each default box,and adjust the box,to better match the object shape.此外,The network combines predictions from multiple feature maps at different resolutions,to handle objects of different sizes naturally.Relative to methods that require object advice,SSD很简单,Because it completely eliminates proposal generation and subsequent pixel or feature resampling stages,and encapsulate all computations in a single network.这使得SSD易于训练,And can be directly integrated into systems that require a detection component.
1.2. Introduction
This paper presents the first deep network-based object detector,It does not assume and resample pixels or features for bounding boxes,And as accurate as any other method.This enables a significant increase in the speed of high-accuracy detection.
我们并不是第一个这样做的人,但是通过增加一系列的改进,We managed to greatly improve the accuracy over previous attempts.Our improvements include the use of small convolutional filters to predict object classes and offsets in bounding box locations,Use separate predictors for different aspect ratio detections(滤波器),and applying these filters to multiple feature maps from later stages of the network,in order to perform detection at multiple scales.通过这些修改——特别是使用多层进行不同尺度的预测——We can achieve high accuracy with relatively low resolution input,从而进一步提高检测速度.
- 我们引入了SSD,This is a one-shot detector for multiple classes,than the previous single-shot detector(YOLO)更快,更准确,Actually with the slower technique of performing explicit region proposals and merging(包括更快的R-CNN)一样准确.
- SSDThe core of is the use of small convolutional filters applied to feature maps to predict class scores and box offsets for a fixed set of default bounding boxes.
- 为了实现高检测精度,本文Generate predictions at different scales from feature maps at different scales,并通过纵横比明确地分离预测.
- These design features lead to easy end-to-end training and high accuracy,This is true even on low-resolution input images,The balance between speed and accuracy has been further improved.
Faster RCNN存在的问题:
- 对小目标检测效果很差
- Predict on only one feature layer
- feature mapAbstract to a relatively high dimension,lead to loss of details
- Predicting small targets needs to rely on detailed information
- Predict on only one feature layer
- 模型大,检测速度较慢
- The forecasting process is divided into two steps
2. The Single Shot Detector (SSD)
2.1. Model
2.1.1. Multi-scale feature maps for detection
Add convolutional feature layers at the end of the truncated base network.这些层的尺寸逐渐减小,and allows detection predictions at multiple scales.The convolutional model used to predict detection is different for each feature layer
- 输入
- The image input to the network must be of size 300x300 的RGB图像(Performed on the input image before feeding it into the networkResize)
- 网络结构
- 以VGG16作为基础网络,将VGG16The two fully connected layers are replaced by ordinary convolutional layers and multiple convolutional layers are added at the end of the network structure to obtain morefeature map用于预测
- Convx_y
- x:第x组卷积层
- y:第xThe first of the group convolutional layersy个卷积层
- Convx_y
- VGG16-16 through Conv5_3 layer -> VGG16模型中Conv5_3之前的所有层
- Conv4_3输出的大小为 38 × 38 × 512 38 \times 38 \times 512 38×38×512 feature matrix ieSSD的第1feature prediction layer
- 将VGG16的max_pool5从原本的 2 × 2 − s 2 2 \times 2 - s2 2×2−s2 修改成 3 × 3 − s 1 3 \times 3 - s1 3×3−s1,此时 Conv5_3 layer The size of the output matrix and Conv5_3 layer的大小相同
- Conv6 -> VGG16模型中的FC6
- because it has been modifiedmax_pool5,我们可以发现, Conv6The size of the convolutional layer is Conv4_3大小的一半
- 通过一个 3 × 3 × 1024 3 \times 3 \times 1024 3×3×1024 的卷积层
- Conv7 -> VGG16模型中的FC7
- Conv7输出的大小为 19 × 19 × 1024 19 \times 19 \times 1024 19×19×1024 feature matrix ieSSD的第2feature prediction layer
- 通过一个 1 × 1 × 1024 1 \times 1 \times 1024 1×1×1024 的卷积层
- Conv8_2
- Conv8_2输出的大小为 10 × 10 × 512 10 \times 10 \times 512 10×10×512 feature matrix ieSSD的第3feature prediction layer
- 通过一个 1 × 1 × 256 1 \times 1 \times 256 1×1×256 的卷积层
- 通过一个 1 × 1 × 1024 1 \times 1 \times 1024 1×1×1024 的卷积层
- Conv9_2
- Conv9_2输出的大小为 5 × 5 × 256 5 \times 5 \times 256 5×5×256 feature matrix ieSSD的第4feature prediction layer
- 通过一个 1 × 1 × 128 1 \times 1 \times 128 1×1×128 的卷积层
- 通过一个 3 × 3 × 256 − s 2 3 \times 3 \times 256 - s2 3×3×256−s2 的卷积层
- Conv10_2
- Conv10_2输出的大小为 3 × 3 × 256 3 \times 3 \times 256 3×3×256 feature matrix ieSSD的第5feature prediction layer
- 通过一个 1 × 1 × 128 1 \times 1 \times 128 1×1×128 的卷积层
- 通过一个 3 × 3 × 256 − s 1 3 \times 3 \times 256 - s1 3×3×256−s1 的卷积层
- Conv11_2
- Conv11_2输出的大小为 1 × 1 × 256 1 \times 1 \times 256 1×1×256 feature matrix ieSSD的第6feature prediction layer
- 通过一个 1 × 1 × 128 1 \times 1 \times 128 1×1×128 的卷积层
- 通过一个 3 × 3 × 256 − s 1 3 \times 3 \times 256 - s1 3×3×256−s1 的卷积层
- 需要注意的是:
- Conv8_2、Conv9_2 The stride of the convolutional layers passed stride = 2,padding = 1
- Conv10_2、Conv11_2 The stride of the convolutional layers passed stride = 1, pading = 0
- The first layer tends to detect smaller objects,as the level of abstraction increases,Later prediction layers will detect larger objects
- VGG16模型结构图
- 以VGG16作为基础网络,将VGG16The two fully connected layers are replaced by ordinary convolutional layers and multiple convolutional layers are added at the end of the network structure to obtain morefeature map用于预测
- 输出
- After the last layer of pooling,会输出8732个预测框(文中称为Default box,与Faster RCNN中的anchor概念是一样的),The network then uses a set of small convolutional filters to perform object class and position offset predictions on these predicted boxes(分类和定位),经过NMS(非极大值抑制)The algorithm then outputs the detection result.
2.1.2. Convolutional predictors for detection
每个添加的特征层(Or alternatively an existing feature layer from the base network)Both can use a set of convolutional filters to generate a fixed set of detection predictions.这些都显示在SSDThe top of the network architecture is for size m × n m \times n m×n 的具有 p p p 通道的特征层,The basic elements for predicting potential detection parameters are 3 × 3 × p 3 \times 3 \times p 3×3×p 小内核,It produces a score for a category,or a shape offset relative to the default box coordinates.in each of the application kernels m × n m \times n m×n 位置,It produces an output value.The bounding box offset output values are measured relative to the default box positions relative to each feature map position.
具体而言,对于给定位置处的 k k k 个 default box 中的每一个,我们计算 c c c 个类别分数和相对于原始默认边界框形状的 4 4 4 个偏移量.这导致在特征映射中的每个位置周围应用总共 ( c + 4 ) k (c+4)k (c+4)k 个滤波器,对于 m × n m \times n m×n 的特征映射取得 ( c + 4 ) k m n (c+4)kmn (c+4)kmn 个输出.
- ( c + 4 ) × k = c × k + 4 × k (c + 4) \times k = c \times k + 4 \times k (c+4)×k=c×k+4×k
- c × k c \times k c×k:预测每一个 default box the corresponding category score
- c 个目标分数
- Scores including background(The first score is the background score)
- 在 feature map will be generated at every location on the k 个 default box
- c 个目标分数
- 4 × k 4 \times k 4×k:每一个 default box 的边界框回归参数
- 对应的顺序为: ( x , y , w , h ) (x, y, w, h) (x,y,w,h)
- 对于每一个 default box 只生成 4 × k 4 \times k 4×k,Not paying attention to every one default box 属于哪一个类别
- 与faster rcnn不同,faster rcnn 对于每一个 anchor 都会生成 4 × c 4 \times c 4×c 个边界框回归参数,Because different categories are considered
- c × k c \times k c×k:预测每一个 default box the corresponding category score
2.1.3. Default boxes and aspect ratios
对于网络顶部的多个特征映射,我们将一组默认边界框与每个特征映射单元相关联.默认边界框以卷积的方式平铺特征映射,以便每个边界框相对于其对应单元的位置是固定的.在每个特征映射单元中,我们预测单元中相对于默认边界框形状的偏移量,以及指出每个边界框中存在的每个类别实例的类别分数.我们的默认边界框与 Faster R-CNN 中使用的 anchor 相似,但是我们将它们应用到不同分辨率的几个特征映射上.在几个特征映射中允许不同的默认边界框形状让我们有效地离散可能的输出框形状的空间.
- (a):Annotated original image
- (b): 8 × 8 8 \times 8 8×8 feature map,less abstract,More details are retained,Predict smaller targets
- 在 8 × 8 8 \times 8 8×8 feature map Predict the cat,Generated two blues default box Cat information can be well predicted
- : 4 × 4 4 \times 4 4×4 feature map,抽象程度更高,Fewer details are retained,预测较大的目标
- 在 4 × 4 4 \times 4 4×4 feature map Predict the dog on,A red one is generated default box Can be very good at predicting dog information
- default box are placed on different feature layers,进行预测
2.2. Training
2.2.1. Choosing scales and aspect ratios for default boxes
Default boxes 的 scale 以及 aspect 设定:
- scale:目标尺度
- 对于每一层,都有: ( s k , s k + 1 ) (s_k, s_{k+1}) (sk,sk+1)
- 对于长宽比为1,我们还添加了一个默认边界框,其尺度为: s k ′ = s k s k + 1 s_k' = \sqrt{s_k s_{k+1}} sk′=sksk+1
- aspect:A series of scales corresponding to each scale
- 对于Conv4_3、Conv10_2和Conv11_2
- 有4个default box
- 3种比例: 1 : 1 , 2 : 1 , 1 : 2 1:1, 2:1, 1:2 1:1,2:1,1:2
- 对于Conv7、Conv8_2和Conv9_2
- 有6个default box
- 5种比例: 1 : 1 , 2 : 1 , 1 : 2 , 3 : 1 , 1 : 3 1:1, 2:1, 1:2, 3:1, 1:3 1:1,2:1,1:2,3:1,1:3
- 对于Conv4_3、Conv10_2和Conv11_2
All dimensions are generated at the position of each element on the preset feature layerdefault box,That is, the total number of default boxes is :
38 × 38 × 4 + 19 × 19 × 4 + × 10 × 10 × 6 + 5 × 5 × 6 + 3 × 3 × 4 + 1 × 1 × 4 = 8732 38 \times 38 \times 4 + 19 \times 19 \times 4 + \times 10 \times 10 \times 6 + 5 \times 5 \times 6 + 3 \times 3 \times 4 + 1 \times 1 \times 4 = 8732 38×38×4+19×19×4+×10×10×6+5×5×6+3×3×4+1×1×4=8732
2.2.2. Selection of positive and negative samples Matching strategy
- 对于每一个 Ground-Truth box,to match it IoU 值最大的 deafult box
- 对于任意的一个 default box,与任何一个 Ground-Truth box 的 IoU 值大于0.5,Also consider it a positive sample Hard negative mining
All parts outside the positive samples can be considered as negative samples.We do not use all negative samples,Instead, use the highest confidence loss for each default bounding box(highest confidence loss)to sort them,并挑选最高的置信度,以便负例和正例之间的比例至多为 3 : 1 3:1 3:1.我们发现这会导致更快的优化和更稳定的训练.
2.2.3. Training objective
其中 N N N is the number of matched positive samples, α = 1 \alpha = 1 α=1
- 类别损失
置信度损失是在多类别置信度 ( c ) (c) (c) 上的softmax损失.- 参数
- c ^ i p \hat{c}_i^p c^ip 为预测的第 i 个 default box 对应的 GT box(类别为P)的类别概率
- x i j p = { 0 , 1 } x_{ij}^p = \{ 0, 1 \} xijp={ 0,1} 为第 i 个 default box 匹配到的第 j 个 GT box(类别是P)的概率
- match indicator
- 参数
- 定位损失
- 参数
- l i m l_i^m lim:为预测对应第 i i i 个正样本回归参数
- g ^ j m \hat{g}_j^m g^jm 为正样本 i i i 匹配的第 j j j 个 G T b o x GT box GTbox 的回归参数
- g ^ j c x \hat{g}_j^{cx} g^jcx
- g j c x g_j^{cx} gjcx:GT box中心点的x坐标
- d i c x d_i^{cx} dicx:第i个default box的中心点的x坐标
- d i w d_i^w diw:第i个default box的宽度
- g ^ j c y \hat{g}_j^{cy} g^jcy
- g j c y g_j^{cy} gjcy:GT box中心点的y坐标
- d i c y d_i^{cy} dicy:第i个default box的中心点的y坐标
- d i h d_i^h dih:第i个default box的高度
- g ^ j c x \hat{g}_j^{cx} g^jcx
- 参数
- 因为同为one-stage方法(单网络),运行速度可以和YOLO媲美,At the same time for different aspect ratiosobject的检测都有效,这是因为算法对于每个feature map cellAll use multiple aspect ratios and different sizesdefault boxes,这也是本文算法的核心.
- 需要人工设置default boxesThe initial scale and aspect ratio values of .网络中default boxes的基础大小和形状不能直接通过学习获得,而是需要手工设置.而网络中每一层 feature使用的default box大小和形状恰好都不一样,导致调试过程非常依赖经验.
- The target recognition of small size is still relatively poor,还达不到 Faster R-CNN 的水准.This is mainly because small-sized objects are trained with lower-level features(因为小尺寸目标在较低层级IOU较大),较低层级的特征非线性程度不够,无法训练到足够的精确度.
【Koltin Flow(三)】Flow操作符之中间操作符(一)
7-40 奥运排行榜 (25 分)多项排序
CV-Model [4]: MobileNet v3
leetcode 143. 重排链表
黑马案例--实现 clock 时钟的web服务器
7-41 PAT排名汇总 (25 分)多样排序
JSP Webshell free kill
7-42 整型关键字的散列映射 (25 分)
WebShell Feature Value Summary and Detection Tool
7-43 字符串关键字的散列映射 (25 分) 谜之测试点
HCIP Day 11_MPLS Experiment