当前位置：网站首页>YOLOV3

YOLOV3

2022-07-26 00:35:00 【I want to send SCI】

YOLOv3 performance The author directly spends performance on Retinanei On the data graph

The left figure shows different thresholds On the right is the threshold 0.5

map It's all the calculated thresholds ap Do average

Why is the first figure lower than the second figure ？ Because the first figure is that the threshold reaches 0.95 That has to coincide with the label How amazing that is so Definitely not that strong It's a little low . And the second threshold 0.5 coincidence 0.5 Just go That must be better than 0.95 High

Anyway, the author said High threshold performance is unscientific hahahaha

YOLOV2 yes Darknet-19 Yes 19 layer

53 Namely 52 Convolution and a full connection layer And the residual connection is added inside

Backbone network is the most important ！！！！！！！！！！！！！！！！ All fields are obtained by post-processing based on the features extracted from the backbone network ！！ He is the provider of food Target detection head Or the key point is that the detection head is the cook

52 A convolution It's about putting all the convolutional add not residual Oh Is the total 52 Then add the final full connection layer =53

Train this IMagnet After a thousand categories of backbone networks Remove the global average pooling layer behind It acts as a feature extractor

Note that the step size inside is 2 Oh May be 2 Resulting in down sampling

Three scale features are obtained from the input image In the subsequent multi-scale target detection .

These three scales are sampled separately 32 16 8 times

If input 416*416 picture Down sampling ：416/32=13*13 26*26 416/8= 52* 52

Because take off the classification head at the back It becomes a full convolution network There is no full connection layer So it can be compatible with images of any scale

256 608 416 As long as it is 32 Multiple Because our next sampling is 32 Multiple of

The second coordinate starts ： Good performance Small amount of computation More efficient operation GPU fps More bloated It's a little slow But it is also higher than v1 With 19

Floating point computation More efficient use of GPU

v1 gridcell=7 24 Layer convolution 2 Fully connected layer boundingbox

v2 gridcell=13 Darknet-19 18 Convolution +1 Fully connected layer anchor（ A priori box It's the kind that has almost tested tall and thin objects anchorbox They are all tall and thin anchorbox）

Zhihu river da White drawing

Input 416*416*3 Output It's three sizes featuremap13*13*255、 26*26255、52*52*255

255---------------3* 85 3： Every gc Generate 3 individual anchor Every anchor Corresponding to a prediction box Each prediction box corresponds to 5+80 dimension 5：xywhc coco Data sets 80 Conditional probability of categories

13*13*255 The receptive field corresponding to the original image is 32*32、 That means 13*13 Responsible for predicting large objects

because 416/13=32 that 13 It's a grid gc La

26*26255 16*16、 secondary

52*52*255 8*8 、 Small objects

On the sampling 2 times （3*2 =26） In and backbone network 26*26 Scale features for splicing After processing, we get 26*26*255

concat ： The operation of stacking two exercise books The thickness of the two books is different Just pile it up along the thickness direction

26*26 This is also sampled 2 times （26*2=52） And backbone network 52*52 Feature stitching of scale Get processed 52*52*255 Characteristics of

in other words ： Actually, the last one 52*52*255 Characteristics of Integrate the front 26*26 features It's also a fusion of 13*13 Characteristics of

It gives play to the semantic specialization and abstraction of deep network It also makes full use of the bottom features of the fine-grained pixel level edge corner structure information of the shallow network

Multi scale feature fusion Object detection of different scales

Conditional probability ： Suppose the box already exists The probability that he is a cat Dog probability

The backbone neck head

Backnone extract Neck Fusion features fpn head The final prediction

The backbone Full convolution network No full connection Compatible 32 Different scales of multiples

share 9 individual anchor

No longer look at the center of the object gridcell In the Look whose anchor Of iou With objects iou Maximum By the big one anchor（ Prediction box ） Prediction object

Not the largest is not a positive sample

Confidence of posterior probability Visualization can see that each box can be seen as a number

YOLOV3 The process

The yellow box marked manually by the dog The central point is the red box

The red one gc There will be three anchor Find and dimension box iou The biggest one anchor Use it to predict objects

YOLOV1 It's the most 98 individual

The larger the input image Got gridcell The number of big prediction boxes is gridsize The number of *3 , The number of prediction frames of the three scales is also large

Selection of positive and negative samples ！！！！！！！！！！！！！！！！！！！！！！！！！！！iou
Greater than threshold iou Maximum Positive sample
Greater than the threshold but not iou Maximum Ignore
Less than the threshold is Negative sample Blue and green ！！