当前位置：网站首页>Yolov3 complete explanation - from the perspective of data coding

Yolov3 complete explanation - from the perspective of data coding

2022-07-28 12:11:00 【alex1801】

Yes yolo Make a summary of the development of the series of articles . The essence of neural network training model is to compare the true value with the predicted value , The most fundamental difference in implementing different tasks is the coding of truth values . The key to understanding deep learning is to understand how real values are encoded , From this point of view .

1、 The introduction of questions

Deep learning was first used to solve classification problems , For one 10 Classification task , Code the category as one-hot form .

For a question of classification , We want to input a picture , Output category . Pedestrians in four categories 、 Bicycle 、 The motorcycle 、 Take the car as an example , The image is a digital matrix , So it is easy for us to think of using four numbers to describe the four categories . Just describe ：1、 Pedestrians ;2、 Bicycle ;3、 The motorcycle ;4、 The car . Let the model predict the output 1~4 The number four . Is that all right? ？

You can intuitively see that the distance between these four categories is different , In practice, there is no relationship between these four categories . therefore , introduce one-hot code , Code the target as the corresponding 0、1 vector , In this way, different categories are independent .

2、 Target detection truth value coding

If you also want to locate the car in the picture , How do you do that ？ We can make the neural network output several more units , Output a bounding box . Specifically, let the neural network output more 4 A digital , Marked as 𝑏𝑥,𝑏𝑦,𝑏ℎ and 𝑏𝑤, These four numbers are the parametric representation of the bounding box of the detected object .

There are four categories , The output of the neural network is the four numbers and a classification label , Or the probability of the occurrence of classification labels . Target tag 𝑦 Is defined as follows ：

𝑦 =[𝑝𝑐, 𝑏𝑥, 𝑏𝑦, 𝑏ℎ, 𝑏𝑤, 𝑐1, 𝑐2, 𝑐3, c4]

The above method can only be used in classification , Detect a target （ The output dimension of neural network is fixed , You can't predict an unknown number of goals ！）, How to detect multiple targets ？

2.1、 Explain from the perspective of sliding window

Suppose a 14*14*3 After a series of convolution , The final output 1*1*4 vector . This 1*1*4 Vector is the result of our classification of this image . If you change the output vector to 1*1*(1+4+4) It is the classification and positioning of this picture category .

The same network , We are 16×16×3 Slide the window on the small image , Convolution network is running 4 Time , So I output 4 A label . Map to original , It is equivalent to making a size of 14*14 In steps of 2 The sliding window .

The same network , Input 28*28*3 , The final output 8*8*4 vector . Map to original , It is equivalent to making a size of 14*14 In steps of 2 The sliding window .

2.2、 Predict and explain from the position

Then explain the problem of target detection and coding from another angle . The above target location can recognize the size and position of a target in the image , But what should I do to identify multiple ？

A direct idea , Then increase the dimension of the eigenvector . The original four types of target output vectors are ：（1+4+4）, Now increase the quantity, that is ：（1+4+4）*N,N Is the maximum number of targets to predict . Is that all right? ？

problem , The quantity can meet the requirements far fetched , But there is no relationship between the position of the target and the position of the vector ！ therefore , It's easy to think that to detect four targets, output the image to the last four grids , The dimension of the vector converges to the lattice , It forms the current coding form . Expand the convolution of the last feature layer by layer , It should be （1+4+4） layer 2*2 Convolution of . The truth value can be encoded accordingly .

3、yolov3 Output encoding

To input 416*416 For example , In practice, the image size can be arbitrary , The author first zooms the image to 416*416 size , The corresponding coordinates are also scaled to 416*416 Corresponding position of size . Here's the picture , Target at 416*416 The coordinates under the image size are ：(cx,cy,bw,bh) = (5.86, 7.12, 1.41, 2.13).

v3-tiny There are two yolo Output , The structure is the same , One of us 13*13 For example . Take three category detection as an example , Output is ：batch*(4+1+3)*13*13. When the center point of an object falls into a certain grid, it predicts the object with a certain grid , The lattice that does not fall into the object is assigned 0.

3.1、 Coordinate regression relative value coding

For the regression of coordinates, the author did not use the original size , Instead, relative values are used . Relative values constrain a range , Accelerated training , It also reduces the drift of coordinates .

Center point x, y The prediction of consists of a decimal part tx, ty Add the integer part cx, cy（ That is, the coordinates of the lattice ） determine ,tx, ty after sigmod The output value of the function is constrained to 0~1. after sigmod constraint , Most of the actual predictions are limited to -4~4 Range .

Prediction of width and height , Under simple constraints ,relu Under activation , Make sure the prediction result is positive .yolo The author in v1 when , Consider the case that the contribution of the size box to the loss is inconsistent , The loss function uses sqrt(w),sqrt(h) constraint , Reduce the effect of frame size on loss .

3.2、 Anchor point 、 Anchor frame （anchor）

Use only grids , A grid can only predict one target , When the objects in the image are dense , It is easy for the center points of multiple targets to fall into the same grid . And different targets have different sizes , The aspect ratio is also different .

This is the time to introduce anchor Mechanism .yolo It uses 3 individual ,faster rcnn Series use 6 individual . We take two anchor For example ：

Not applicable to anchor Come on , The vector used in the upper lattice is ：

Use two anchor, The above vector is changed to ：

The target center point falls on this grid , Later, it is judged that it is more similar to its size anchor Predict the goal .

In actual engineering , A threshold is often set ,iou>th The anchor box can predict the goal , When no anchor box exceeds the threshold , Choose the most similar anchor box prediction .

The actual width and height prediction is , It is also used to predict the size of the relative anchor box , as follows ：

y = exp(x) Function Visualization ：

You can see , Most of the predictions are constrained to -2~2 Between . Decision reasoning , Every anchor Prediction of location th, tw Take this anchor The width and height of ph, pw effect , Generate the final bw, bh.

4、yolo Output vector format

Output is ：batch * 13 * 13 * (tx, ty, tw, th, conf, ont-hot-class). Assume that a total of 80 Class target , Then the output vector is ：batch * 13 *13 * （4+1+80）.

Extended article ：

1、 Online curve generator

Draw polynomials online / Function curve graph tool - Online calculator - Script house online tool

原网站

版权声明
本文为[alex1801]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/197/202207131135014912.html