当前位置:网站首页>On anchors in object detection
On anchors in object detection
2022-06-27 09:43:00 【Xiaobai learns vision】
Click on the above “ Xiaobai studies vision ”, Optional plus " Star standard " or “ Roof placement ”
Heavy dry goods , First time delivery Guide to Jishi
This paper uses a series of questions , It clearly explains the anchors The concept and significance of .
today , I will discuss an elegant concept introduced in the object detector —— Anchors, How it helps detect objects in an image , And they are compared with the traditional two-stage detector Anchor What's the difference? .
As usual , Let's look at some of the questions ,anchors Was introduced as a solution .
At the beginning of use anchors Before , Let's see how the two-stage object detector works , And how they actually promote the development of single-stage detectors .
Two stage object detector : The traditional two-stage object detector detects objects in the image in two stages :
1、 The first stage : The first stage traverses the input image and the output area where the object may appear ( It is called the suggested area or the area of interest ). This process can be done by external algorithms ( for example :selective search) Or a neural network .
2、 The second stage : The second stage is a neural network , It accepts these regions of interest , And classify it into a target object class .
For the sake of simplicity , I will introduce a famous two-stage detector —— Faster-RCNN. Both stages involve a neural network .
1、 The first neural network predicts where an object may appear ( Also known as objectness score ). It is basically a vision ( object ) And the classification of the background . This network is called the regional advice network , also called RPN.
2、 After extracting the regional recommendations , Crop the corresponding position in the input image , Send it to the next neural network for classification , Suppose there is N A target class . This network predicts what objects exist at that location .
step 2 It looks very simple , Because it can be reduced to image classification , Divide the target object into N One of the categories .
Let's delve into chapter 1 Step .
(a) How does this neural network predict the location of these targets ?
(b) If we can train neural network to classify foreground and background , So why not train it to predict all at once N What about a class ?
(a) The solution anchors,(b) The answer is yes , We can use a single network to perform N-way object detection , Such a network is known as a single-stage target detector . Single stage detector and Faster-RCNN The network in the first phase of is almost the same .
I said, SSD and RPN It's almost the same , Because they are conceptually the same , But the architecture is different .
How to detect objects in images by neural networks ?
Solution (1) —— Single target detection : Let's use the simplest case , Find a single object in an image . Given an image , The neural network must output the class of the object and the coordinates of its bounding box in the image . So the network must output 4+C A digital , among C It's the number of categories .
The input image can be directly passed through a set of convolution layers, and then the final convolution output can be converted into a 4+C Dimension vector , among , front 4 A number indicates the position of an object ( such as minx, miny, maxx, maxy), hinder C A number represents the score of category probability .
Solution (2) —— Multi target detection : This can be done by extending the above method to N Objects to achieve . therefore , The output of the network is not 4+C The number of , It is *N*(4+C)* Numbers .
Take a size of H x W x 3 Let it pass through a set of convolution layers to get a size of H x W x d Convolution of ,d Is the depth or number of channels .

Pass the image through ConvNet Get the output characteristic diagram
Consider the output convolution above volume. hypothesis volume The size is 7×7×512. Use N Size is 3 x 3 x 512 Filter for ,stride=2, padding=1, The resulting size is 4 x 4 x N Output volume.
We take this size as 4 x 4 x N Try to infer its meaning from the output of .
In the output characteristic diagram, there are 16 individual cells, We can say , Every cell There is a receiving domain ( Or feel the wild ), Corresponds to a point in the original image . Every one of these cell There are N A number related to it . As I pointed out earlier ,N It's the number of categories , We can say , Every cell It's all about feature map Information about the object appearing at the corresponding position in . In the same way , There is another parallel conv head , Among them is 4 Size is 3 x 3 x 512 Filter for , Apply to the same conv volume On , To get another size of 4 x 4 x 4 Output —— This corresponds to the offset of the bounding box .
Now we know how to use a neural network to predict multiple targets . But wait a minute , How do we calculate this output as 4x4xn Of cell What about the loss ?
Now let's drill down to the... Used by the output layer N Of the filters . from N Take out one of the filters , See how it passes on feature map Convolute to get the output .

3 x 3 The filter is in 7 x 7 Convolution on the characteristic graph of ,stride = 2
This 3x3 The filter can be used in 7x7 Move on the grid of 16 A different place , And make predictions ( As mentioned earlier ), This is very obvious .
We know , In the grid 16 individual cell Corresponds to a specific location in the layer before it . Look at the chart below . Output the first in the grid cell There is a size of 3x3 The reference frame of .

first cell Can be associated with a specific location in the input image , Predict from this location
Similarly , Each... In the output cell Can be associated with a specific location in the input image , Predict from this location .
So there is 16 There are four such reference positions ( The size is 3x3) —— Each position has its own coordinates relative to the input image . Now through these reference positions , We can achieve two goals :
1、 Classified loss : If N One of the objects fell here 16 Reference locations , I.e ground truth Of the bounding box IOU≥ A certain threshold , Then we know which one to match ground truth 了 .
2、 Return to loss : Why do we need this ? Suppose an object falls into one of the reference frames , We can simply output the actual coordinates of these reference positions relative to the input image . The reason is that objects don't have to be square . therefore , We are not naive enough to output a set of fixed frame coordinates , But by outputting 4 An offset value is used to adjust the default coordinates of these reference positions . Now we know ground truth box Coordinates and corresponding reference position coordinates , We can simply use L1/L2 Distance to calculate the regression loss .
It is different from the task of image classification in which only the output vector needs to be matched , Here we have 16 Reference positions to match . This means that the network can predict 16 An object . The number of objects to be predicted can be increased by predicting multiple feature map levels or by increasing the so-called reference positions on the feature map .
These reference locations are anchor boxes perhaps default boxes.
In the example above , only one anchor box , That is, only one prediction is made for each filter position .
Usually , stay feature map in , Every filter Positions can be predicted many times —— This means that there are as many references as there are predictions needed .
Suppose that each filter The location is 3 A reference .

Every filter There are three positions boxes —— One is 3x3( Orange ), One is 1x3( Blue ), The other is 3x1( green )
As we saw before , The output is anchor Box function , So if you refer to /anchor Quantity change of , The size of the output will also change . therefore , The network output is not 1 individual anchor Dot 4x4xN( and 4x4x4), But because of anchor Count =3, So the output is 4x4x(N*3).
Generally speaking , The output shape of a single-stage detector can be written as :
The shape of the classification header :HxWxNA
Return to the shape of the head :HxWx4A
In style ,A For the use of anchrs The number of .
A problem !
Every filter There are multiple positions anchors/ What is the meaning of the reference box ?
This enables the network to predict multiple targets of different sizes at each given location in the image .
This variant of a single-stage detector that uses a convolution layer at the end to obtain an output is called SSD, The variant that uses the full connection layer at the end to get the output is called YOLO.
I hope I have put anchor The concept of has become easy to understand .anchor It is always a difficult concept to grasp , There is still some information about anchor The unsolved problem of . I want to answer these questions in the following article . See you then :)
The good news !
Xiaobai learns visual knowledge about the planet
Open to the outside world

download 1:OpenCV-Contrib Chinese version of extension module
stay 「 Xiaobai studies vision 」 Official account back office reply : Extension module Chinese course , You can download the first copy of the whole network OpenCV Extension module tutorial Chinese version , Cover expansion module installation 、SFM Algorithm 、 Stereo vision 、 Target tracking 、 Biological vision 、 Super resolution processing and other more than 20 chapters .
download 2:Python Visual combat project 52 speak
stay 「 Xiaobai studies vision 」 Official account back office reply :Python Visual combat project , You can download, including image segmentation 、 Mask detection 、 Lane line detection 、 Vehicle count 、 Add Eyeliner 、 License plate recognition 、 Character recognition 、 Emotional tests 、 Text content extraction 、 Face recognition, etc 31 A visual combat project , Help fast school computer vision .
download 3:OpenCV Actual project 20 speak
stay 「 Xiaobai studies vision 」 Official account back office reply :OpenCV Actual project 20 speak , You can download the 20 Based on OpenCV Realization 20 A real project , Realization OpenCV Learn advanced .
Communication group
Welcome to join the official account reader group to communicate with your colleagues , There are SLAM、 3 d visual 、 sensor 、 Autopilot 、 Computational photography 、 testing 、 Division 、 distinguish 、 Medical imaging 、GAN、 Wechat groups such as algorithm competition ( It will be subdivided gradually in the future ), Please scan the following micro signal clustering , remarks :” nickname + School / company + Research direction “, for example :” Zhang San + Shanghai Jiaotong University + Vision SLAM“. Please note... According to the format , Otherwise, it will not pass . After successful addition, they will be invited to relevant wechat groups according to the research direction . Please do not send ads in the group , Or you'll be invited out , Thanks for your understanding ~边栏推荐
- 最全H桥电机驱动模块L298N原理及应用
- [diffusion model]
- 细说物体检测中的Anchors
- Five page Jump methods for wechat applet learning
- Shortcut key bug, reproducible (it seems that bug is the required function [funny.Gif])
- [200 opencv routines] 212 Draw a slanted rectangle
- This application failed to start because it could not find or load the QT platform plugin
- Freemarker
- On June 23, the video address of station B in the third episode of rust chat room
- JS 文件上传下载
猜你喜欢

ucore lab4

手把手带你玩摄像头模组

ucore lab3

Installation and use of SVN version controller

QT运行显示 This application failed to start because it could not find or load the Qt platform plugin

Take you to play with the camera module

产品力对标海豹/Model 3,长安深蓝SL03预售17.98万起

Your brain is learning automatically when you sleep! Here comes the first human experimental evidence: accelerate playback 1-4 times, and the effect of deep sleep stage is the best

Video file too large? Use ffmpeg to compress it losslessly

前馈-反馈控制系统设计(过程控制课程设计matlab/simulink)
随机推荐
QT运行显示 This application failed to start because it could not find or load the Qt platform plugin
leetcode:968. 监控二叉树【树状dp,维护每个节点子树的三个状态,非常难想权当学习,类比打家劫舍3】
Semi supervised learning—— Π- Introduction to model, temporary assembling and mean teacher
Principle and application of the most complete H-bridge motor drive module L298N
unity--newtonsoft.json解析
【STM32】HAL库 STM32CubeMX教程十二—IIC(读取AT24C02 )
小白也能看懂的网络基础 03 | OSI 模型是如何工作的(经典强推)
1098 Insertion or Heap Sort(堆排序解释)(PAT甲级)
如何获取GC(垃圾回收器)的STW(暂停)时间?
导师邀请你继续跟他读博,你会不会立马答应?
Use of bufferedwriter and BufferedReader
.NET 中的引用程序集
Only one ConfirmCallback is supported by each RabbitTemplate 解决办法
提高效率 Or 增加成本,开发人员应如何理解结对编程?
【系统设计】邻近服务
js的数组拼接「建议收藏」
产品力对标海豹/Model 3,长安深蓝SL03预售17.98万起
【SO官方采访】为何使用Rust的开发者如此深爱它
Video file too large? Use ffmpeg to compress it losslessly
Bluetooth health management device based on stm32