当前位置:网站首页>2D human posture estimation for posture estimation - numerical coordinate progression with revolutionary neural networks (dsnt)
2D human posture estimation for posture estimation - numerical coordinate progression with revolutionary neural networks (dsnt)
2022-06-10 15:49:00 【light169】
Reference resources : 【 Paper reading notes 】Numerical Coordinate Regression with Convolutional Neural Networks_ Time machine ゚ The blog of -CSDN Blog
Address of thesis :Numerical Coordinate Regression with Convolutional Neural Networks
Code address :GitHub - anibali/dsntnn: PyTorch implementation of DSNT
One 、 A summary of the paper
This paper provides a kind of An idea of learning coordinates directly from images . Now the mainstream methods are based on Gaussian kernel processing heatmap As a supervisor , But what this method has learned heatmap, In the process of obtaining coordinates by post-processing , There is quantization error ( such as 4 Times lower sampling Of heatmap, The expectation of quantization error is 2).
This paper presents a new processing method , be called differentiable spatial to numerical transform(DSTN), adopt DSNT Handle ( No additional parameters added ), Directly supervise the coordinates .DSNT It's right heatmap Processed , The train of thought is shown in the figure below . Final processing , Will be heatmap adopt softmax, Get one be based on heatmap Probability distribution of , And through this A probability distribution , And The default is X,Y( Axis ) Do point multiplication , Get the expected value of coordinates . Monitoring loss is also based on this expectation .

Although the ideas in this article , It is mainly the regression of coordinates directly , But in practice , Or right heatmap Constrained , And the weight is not small . Think about it from another angle , In fact, the actual operation of this article , Or you could say , It's right heatmap Did supervision , Then a coordinate regularization factor is added . The supervision of the regularization term , Can effectively reduce heatmap Quantization loss converted to coordinates , With some direct to heatmap The loss error caused by regression is inconsistent with the expectation . however , This heatmap The loss of items is also carefully selected , Not even adding heatmap Loss item , More than many heatmap The result of the loss calculation method is better .
however , For those keys that do not exist in the image ( Such as half body ), And many people ,DSNT Can not be solved directly . For some scenarios , This is an inevitable problem .
Two 、 Innovation points
The current numerical coordinate regression tasks exist in a large number of practical needs , For example, human key point detection 、 Face key point detection 、 Object key point detection and 3d Posture , The essential tasks of these problems can be summarized as Numerical coordinate regression , Therefore, this paper studies a problem of this kind Common solutions , Not for specific tasks , But for the sake of comparison , In this paper, the human posture estimation task is used to illustrate .
say concretely
At present, the key point of the mainstream is to return to two approaches :
(1) Use the whole connection layer to directly return the coordinate points , for example yolo-v1. The advantage of this method is that the output is the coordinate point , Training and forward speed can be fast , And end-to-end total differential training ; The disadvantage is the lack of spatial generalization ability , In other words, the spatial information on the feature map is lost .
Spatial generalization It refers to the ability of knowledge acquired at one location during model training to be extended to another location in the reasoning stage , for instance , If I have a ball in the upper left corner of the picture during training , But the ball was in the lower right corner during the test , If the network can detect or recognize , So the model has the ability of spatial generalization . It can be seen that the coordinate point regression task needs this ability very much , Because I can't train every picture in every position . The reason why the full convolution model has this ability is the weight sharing , However, for the full connection layer , stay 2014 Year of Network in network Paper points out fully connected layers are prone to overfitting, thus hampering the generalization ability of the overall network. That is to say, the ability of spatial generalization will be greatly damaged if the full connection output coordinate point method is adopted , In fact, it is easy to analyze theoretically : There is a ball in the upper left corner of the picture during the training period ,reshape After drawing into a one-dimensional vector , The activation weight of the full connection layer is all in the upper part , The weight of the lower half is not trained , When you test, enter a ball in the lower right corner of the picture , After drawing into a one-dimensional vector , Due to the weight failure of the lower half , Theoretically, it cannot be predicted , That is, there is no spatial generalization ability . The convolution operation is due to weight sharing , Can be effectively avoided . To sum up : The weight obtained by full connection method depends heavily on the distribution of training data , It is very easy to cause over fitting , This phenomenon is really serious when I predict key projects .
All connected to the network (fully connect), That is, each element unit and hidden layer Neurogen Make a full connection . The parameter quantity becomes 

Weight sharing means using the same filter Go and scan the image , It is equivalent to presenting a feature , Get one feature map.

No weight sharing
The number of parameters of convolution kernel is consistent with the size of image pixel matrix , namely
.
(2) Prediction Gauss Thermogram The way , then argmax The index corresponding to the peak value is the coordinate point , for example cornernet、grid-rcnn and cpn wait . Take single person pose estimation as an example , The output is a picture of just one person , The input is a Gaussian heat map of all the keys ,label It is a gaustu generated based on each key point . If everyone wants to return 17 A key point , Then the predicted output characteristic graph is (batch,h_o,w_o,17), That is, each channel is a heat map that predicts a joint point , Each channel is then argmax You can get integer coordinates .
The output method based on Gauss heat map will have higher accuracy than direct regression of coordinate points , The reason is not that the output mode of Gauss heat map is well expressed , But because its output characteristic diagram is large , Strong spatial generalization ability leads to , So naturally it can be explained if I still use (1) The method of direct regression coordinates predicts , But I no longer use full connection , However, in the full convolution method, the accuracy is still lower than that of the Gaussian heat map , The reason is that even the full convolution output , But like yolo-v2、ssd When the output characteristic diagram is very small , As a result, the spatial generalization ability is not as good as the method (2).
From a numerical point of view , It must be a good way to directly return to the coordinate points , Because if you directly return to the coordinate point , The output is a floating point number , No loss of accuracy , And the Gaussian heat output must be an integer , This involves a problem of lower bound of theoretical error . Suppose the input picture is 512x512, The output is reduced 4 Double is 128x128, So suppose a key point location is 507x507, So shrink 4 After The Times , Even if the Gaussian heat map is restored without any error , There will also be a maximum 507-126*4=3 Pixel error , This 3 Is the lower bound of theoretical error . If the reduction factor increases , Then the lower bound of theoretical error will rise . So at present, most of the practices consider the compromise between speed and accuracy , Use zoom out 4 Times the way .
The advantage of this approach is that the accuracy is usually higher than Direct regression coordinate points of the whole connection layer Method ; The disadvantages are obvious , It is not a fully differential model from input to output , Because from heatmap To coordinate , It's through argmax It is obtained offline ( Actually, since argmax Do not guide , Then use soft argmax Instead , Some papers do ). And because the output characteristic diagram required is very large , Training and forward speed are slow , And memory consumption is large .
stay heatmap In the process of generating coordinates , shortcoming :(1) The use of argmax And so on , Is non differentiable , No direct learning ;(2)heatmap To the coordinates , There are quantization errors .heatmap The larger the down sampling multiple with the input resolution , The larger the quantization error . Even more remarkable , Supervision is based on heatmap Upper , This will result in a loss function with our metric ( In coordinates ) They are separated from each other . In reasoning , We only use one of them ( Some ) Numerical coordinates of pixels are calculated , But in training , Loss of all pixels .

The first picture is target Thermogram , The second picture and the third picture are two hypothetical predictions , Under normal circumstances , The second prediction is more accurate , But in fact, if mse loss, So the third graph loss Smaller than the second pair , So there's a problem , The key points that will lead to the prediction are inaccurate .
To sum up , Although the accuracy of Gaussian heat map prediction is usually higher than that of regression method , But it has several very troublesome problems :(1) The output graph is very large , This leads to excessive memory consumption 、 Reasoning and training are slow ;(2) There is a lower bound of theoretical error ;(3) mse loss It may lead to the deviation of the learning results ;(4) Not a total differential model ;
The following table shows heatmap,fully connection,DSNT The advantages and disadvantages of three methods to obtain coordinates . As can be seen from the table ,heatmap Not total differential , Poor performance at low resolution ;fully connection, No spatial generalization capability , And easy to fit ; and DSNT Has all the advantages .

Personal opinion : The reason DSNT You can get the coordinates directly , At the same time, it has the ability of spatial generalization , It's two points :(1) Its Yes heatmap Supervised , The monitoring object is Gaussian distribution , Have symmetry ;(2) It is relative to the axis object X,Y Well designed , Namely
and
The unidirectionality of , Make it symmetrical in two coordinate axes .
For the advantages and disadvantages of the above two mainstream methods , Can we design a model , At the same time have Method (1) Total differential training , also Have (2) Spatial generalization ability of . So this paper designs a differentiable spatial to numerical transform(DSTN) Module to make up for the two gap, And the designed module has no training parameters , It can be predicted on low resolution Gauss map , The main function is to let the gradient flow from the coordinate point to the Gaussian heat map , Without adding additional parameters and calculation .
The figure below shows the use of heatmap supervise , And use DSNT To supervise the coordinates heatmap The difference between ( That's right heatmap No supervision , But it's not , A regularization term is added ),DSNT The results are more concentrated .
heatmap In the input to DSNT Before , To be normalized , become Probability distribution .normalized It means nonnegative , Add up to 1. There are four normalization attempts in the following table , The final choice softmax Normalized as rectifier Function of .
Four 、 Network structure
Let's start with the concrete implementation . First of all, clear DSNT The input and output of the module , hypothesis CNN The original input is (batch,h,w,3), The output is (batch,h//4,w//4,17) Express 17 Key point regression , Express with ,DSNT Acting on each channel , The output is (batch,17,2) Express 17 A key point x,y coordinate .
(1) The Gaussian heat map output from each channel is analyzed normalized, Defined as
.
Ignore normalized denominator , It can be expressed as
, The author designed 4 Kind of normalized methods , Specific for :

To see , Choose different normalization methods , The final result is not far away , But because the table softmax Better means , So the author chose softmax As normalized function .
Why should we normalize to 0~1 Well ? Because the author wants DSNT The input is a discrete probability distribution , Behind the useful .
(2) Convert to coordinate points
First define two matrices X and Y, Its width, height and input DSNT The width is the same as the height , The specific numerical calculation is :





5、 ... and 、Loss
loss The design is simple , Because it is a coordinate point regression , So a very natural loss It must be the sum of squared errors , namely :

p yes label, yes DSNT Module output value .
I hope you will know why Loss It's just a two norm between two points , But the gradient acts on the whole heat map , The reason is that the predicted value of this output is actually the mean value , The position that contributes to the mean value is the whole graph , Therefore, when the gradient is reversed, it will act on the whole heat map .
If only the above Loss It must be possible work Of , But if you don't add any constraints , Then the same key point will appear and generate heat maps of various shapes , Although it has no impact on our final results , But I always feel too free . For example, the network may learn a Gaussian heat map with very large variance and very small variance , In terms of stability , So in reality , We definitely need a Gaussian heat map with small variance , And in the experiment, it is found that introducing some priors is helpful to network training . So we have to discuss how to add a priori .
The simplest and most direct way to add a priori is to regularize Loss, Therefore, we can introduce a Gaussian heat map prior to the regular term . The overall loss by :

(1) Variance regularity
The first kind of regularity you can think of is controlling variance , As mentioned earlier, we hope to learn Gauss heat map , And the variance should be small , So naturally we can write :

The first equation is to calculate the variance , It is only related to the predicted coordinates , The second is the final variance canonical term , By giving x and y Variance constraints of the same magnitude in the direction , It is helpful to learn Gauss heat map .
(2) The distribution is regular
In fact, it's easy to think of another better regular . What you want is to get the Gauss heat map ? So since you've transformed into a probability distribution , So I want his output distribution to be bivariate Gauss , Then you must use KL Divergence ,KL Divergence is largely used to measure the similarity between two distributions , stay GAN It is widely used in . So we have :

Is a bivariate Gaussian distribution , This design can force the network to learn Gauss heat map . because KL Divergence has :1. Always greater than or equal to 0, But not necessarily less than 1; 2. Asymmetry , So the author introduces a better KL Divergence variant :Jensen-Shannon, It is KL Variation of divergence , The range of values is 0~1, And it is symmetrical .
In fact, we can consider it from another angle , for example SVM Loss equally , We can think of coordinate prediction as loss It's regular Loss, And the distribution is regular loss It is the Lord. Loss, From this point of view , In fact, the method proposed in this paper is not different from the original direct Gauss heat map regression , Or directly regression Gauss heat map , However, an additional coordinate regression regular term is added , In order to get the coordinates , Introduced DSNT It's just a module .
The author compares through experiments , give the result as follows :

It can be seen that JS Better , So the final choice of regularity is JS The distribution is regular .

The above figure shows the heat map shape learned under different variance sizes , It can be seen that JS Regular effect is good , But it can be seen that the variance has a great influence , This is very critical in a specific project , Need careful design . It can also be seen that in the variance regularity , Because it only limits the equal variance , The mean for 0, So he can't learn the Gauss heat map directly , Instead, the heat map is divided into four spots around the joint .
4 experiment
See :
边栏推荐
- [rust daily] first release of mnemos on April 20, 2022
- [object].
- 3. Encounter the form of handycontrol again
- 反“内卷”,消息称 360 企业安全云将上线“一键强制下班”功能,电脑自动关闭办公软件
- 姿态估计之2D人体姿态估计 - Distribution Aware Coordinate Representation for Human Pose Estimation【转-修改】
- QT 基于QScrollArea的界面嵌套移动
- idea新建项目报错org.codehaus.plexus.component.repository.exception.ComponentLookupException:
- Kubernetes 1.24: preventing unauthorized volume mode switching
- MapReduce案例之聚合求和
- 云图说|每个成功的业务系统都离不开APIG的保驾护航
猜你喜欢

统一认证中心 Oauth2 认证坑

ORB_ Slam2 visual inertial tight coupling positioning technology route and code explanation 1 - IMU flow pattern pre integration

MapReduce之排序及序列化案例的代码实现
![姿态估计之2D人体姿态估计 - Human Pose Regression with Residual Log-likelihood Estimation(RLE)[仅链接]](/img/c7/9c25da07236ef0bd241b6023e82306.gif)
姿态估计之2D人体姿态估计 - Human Pose Regression with Residual Log-likelihood Estimation(RLE)[仅链接]

Explore the secrets behind the open source data visualization development platform flyfish!

港大、英伟达 | Factuality Enhanced Language Models for Open-Ended Text Generation(用于开放式文本生成的事实性增强语言模型)

MapReduce之分区案例的代码实现

HKU and NVIDIA | factuality enhanced language models for open ended text generation

Interpretation of cube technology | past and present life of cube Rendering Design

Anti "internal roll", it is said that 360 enterprise security cloud will launch the "one click forced off duty" function, and the computer will automatically close the office software
随机推荐
视觉SLAM常见的QR分解SVD分解等矩阵分解方式求解满秩和亏秩最小二乘问题(最全的方法分析总结)
Guanghetong high computing power intelligent module injects intelligence into 5g c-v2x in the trillion market
Net core Tianma XingKong series - Interface Implementation for dependency injection and mutual conversion of database tables and C entity classes
这几个垂直类小众导航网站,你绝对不会想错过
姿态估计之2D人体姿态估计 - Distribution Aware Coordinate Representation for Human Pose Estimation【转-修改】
Hutool Usage Summary (VIP collection version)
ORB_SLAM2视觉惯性紧耦合定位技术路线与代码详解2——IMU初始化
作用域和闭包
Solution to some problems of shadow knife RPA learning and meeting Excel
Interview question details
影刀RPA学习和遇见excel部分问题解决方式
Kubernetes 1.24: preventing unauthorized volume mode switching
ORB_ Slam2 visual inertial tight coupling positioning technology route and code explanation 2 - IMU initialization
Necessary tools for automatic operation and maintenance shell script introduction
Google X开源抓取机械臂,无需人工标注就能一眼找到目标零件[转]
Tensorflow actual combat Google deep learning framework version 2 learning summary tensorflow installation
Huawei cloud SRE deterministic O & M introduction
docket命令
The power of insight
自动化运维必备的工具-Shell脚本介绍