当前位置:网站首页>Yolov3 complete explanation - from the perspective of data coding
Yolov3 complete explanation - from the perspective of data coding
2022-07-28 12:11:00 【alex1801】
Yes yolo Make a summary of the development of the series of articles . The essence of neural network training model is to compare the true value with the predicted value , The most fundamental difference in implementing different tasks is the coding of truth values . The key to understanding deep learning is to understand how real values are encoded , From this point of view .
1、 The introduction of questions
Deep learning was first used to solve classification problems , For one 10 Classification task , Code the category as one-hot form .

For a question of classification , We want to input a picture , Output category . Pedestrians in four categories 、 Bicycle 、 The motorcycle 、 Take the car as an example , The image is a digital matrix , So it is easy for us to think of using four numbers to describe the four categories . Just describe :1、 Pedestrians ;2、 Bicycle ;3、 The motorcycle ;4、 The car . Let the model predict the output 1~4 The number four . Is that all right? ?
You can intuitively see that the distance between these four categories is different , In practice, there is no relationship between these four categories . therefore , introduce one-hot code , Code the target as the corresponding 0、1 vector , In this way, different categories are independent .
2、 Target detection truth value coding
If you also want to locate the car in the picture , How do you do that ? We can make the neural network output several more units , Output a bounding box . Specifically, let the neural network output more 4 A digital , Marked as 𝑏𝑥,𝑏𝑦,𝑏ℎ and 𝑏𝑤, These four numbers are the parametric representation of the bounding box of the detected object .

There are four categories , The output of the neural network is the four numbers and a classification label , Or the probability of the occurrence of classification labels . Target tag 𝑦 Is defined as follows :
𝑦 =[𝑝𝑐, 𝑏𝑥, 𝑏𝑦, 𝑏ℎ, 𝑏𝑤, 𝑐1, 𝑐2, 𝑐3, c4]The above method can only be used in classification , Detect a target ( The output dimension of neural network is fixed , You can't predict an unknown number of goals !), How to detect multiple targets ?
2.1、 Explain from the perspective of sliding window
Suppose a 14*14*3 After a series of convolution , The final output 1*1*4 vector . This 1*1*4 Vector is the result of our classification of this image . If you change the output vector to 1*1*(1+4+4) It is the classification and positioning of this picture category .

The same network , We are 16×16×3 Slide the window on the small image , Convolution network is running 4 Time , So I output 4 A label . Map to original , It is equivalent to making a size of 14*14 In steps of 2 The sliding window .
The same network , Input 28*28*3 , The final output 8*8*4 vector . Map to original , It is equivalent to making a size of 14*14 In steps of 2 The sliding window .

2.2、 Predict and explain from the position
Then explain the problem of target detection and coding from another angle . The above target location can recognize the size and position of a target in the image , But what should I do to identify multiple ?
A direct idea , Then increase the dimension of the eigenvector . The original four types of target output vectors are :(1+4+4), Now increase the quantity, that is :(1+4+4)*N,N Is the maximum number of targets to predict . Is that all right? ?
problem , The quantity can meet the requirements far fetched , But there is no relationship between the position of the target and the position of the vector ! therefore , It's easy to think that to detect four targets, output the image to the last four grids , The dimension of the vector converges to the lattice , It forms the current coding form . Expand the convolution of the last feature layer by layer , It should be (1+4+4) layer 2*2 Convolution of . The truth value can be encoded accordingly .

3、yolov3 Output encoding
To input 416*416 For example , In practice, the image size can be arbitrary , The author first zooms the image to 416*416 size , The corresponding coordinates are also scaled to 416*416 Corresponding position of size . Here's the picture , Target at 416*416 The coordinates under the image size are :(cx,cy,bw,bh) = (5.86, 7.12, 1.41, 2.13).

v3-tiny There are two yolo Output , The structure is the same , One of us 13*13 For example . Take three category detection as an example , Output is :batch*(4+1+3)*13*13. When the center point of an object falls into a certain grid, it predicts the object with a certain grid , The lattice that does not fall into the object is assigned 0.
3.1、 Coordinate regression relative value coding
For the regression of coordinates, the author did not use the original size , Instead, relative values are used . Relative values constrain a range , Accelerated training , It also reduces the drift of coordinates .

Center point x, y The prediction of consists of a decimal part tx, ty Add the integer part cx, cy( That is, the coordinates of the lattice ) determine ,tx, ty after sigmod The output value of the function is constrained to 0~1. after sigmod constraint , Most of the actual predictions are limited to -4~4 Range .

Prediction of width and height , Under simple constraints ,relu Under activation , Make sure the prediction result is positive .yolo The author in v1 when , Consider the case that the contribution of the size box to the loss is inconsistent , The loss function uses sqrt(w),sqrt(h) constraint , Reduce the effect of frame size on loss .

3.2、 Anchor point 、 Anchor frame (anchor)
Use only grids , A grid can only predict one target , When the objects in the image are dense , It is easy for the center points of multiple targets to fall into the same grid . And different targets have different sizes , The aspect ratio is also different .

This is the time to introduce anchor Mechanism .yolo It uses 3 individual ,faster rcnn Series use 6 individual . We take two anchor For example :

Not applicable to anchor Come on , The vector used in the upper lattice is :
![]()
Use two anchor, The above vector is changed to :
![]()
The target center point falls on this grid , Later, it is judged that it is more similar to its size anchor Predict the goal .
In actual engineering , A threshold is often set ,iou>th The anchor box can predict the goal , When no anchor box exceeds the threshold , Choose the most similar anchor box prediction .
The actual width and height prediction is , It is also used to predict the size of the relative anchor box , as follows :

y = exp(x) Function Visualization :

You can see , Most of the predictions are constrained to -2~2 Between . Decision reasoning , Every anchor Prediction of location th, tw Take this anchor The width and height of ph, pw effect , Generate the final bw, bh.
4、yolo Output vector format
Output is :batch * 13 * 13 * (tx, ty, tw, th, conf, ont-hot-class). Assume that a total of 80 Class target , Then the output vector is :batch * 13 *13 * (4+1+80).
Extended article :
1、 Online curve generator
Draw polynomials online / Function curve graph tool - Online calculator - Script house online tool
边栏推荐
- Specific process of strong cache and negotiation cache
- Develop your own NPM package from 0
- boost官网搜索引擎项目详解
- Unity encountered a pitfall and the AB package failed to unload
- Ruiji takeout - day01
- 2022.07.10 summer training personal qualifying (V)
- Several reincarnation stories
- Interfaces and abstract classes
- Know the optical fiber interface and supporting optical fiber cable of can optical fiber converter in fire alarm networking
- ES6 knowledge points supplement
猜你喜欢

REST风格

tolua之wrap文件的原理与使用

Use Baidu PaddlePaddle easydl to complete garbage classification

"Weilai Cup" 2022 Niuke summer multi school training camp 2

Docker runs MySQL service

Full resolution of the use of go native plug-ins

A hundred flowers bloom in data analysis engines. Why invest heavily in Clickhouse?

Redis installation
![[real question of written examination]](/img/3f/e061df6a2c5c92429cfd3c69cc94ce.png)
[real question of written examination]

Develop your own NPM package from 0
随机推荐
Client service registration of Nacos registry
Shell (I)
Unitywebrequest is used in unity to load network and local pictures
Hcip day 1
IDEA复制模块
移动端人脸风格化技术的应用
Specific process of strong cache and negotiation cache
Reflect 机制获取Class 的属性和方法信息
Untiy中控制Animation的播放速度
Service workers let the website dynamically load webp pictures
可视化大型时间序列的技巧。
Static proxy instance
[diary of supplementary questions] [2022 Niuke summer multi school 2] l-link with level editor I
Ruiji takeout - day01
【补题日记】[2022杭电暑期多校2]K-DOS Card
R language uses LM function to build regression model with interactive items, and uses: sign (colon) to represent the interaction of variables (colon is pure multiplication, excluding the constituent
The principle and use of the wrap file of tolua
REST风格
业务可视化-让你的流程图'Run'起来(4.实际业务场景测试)
Lyscript get previous and next instructions