当前位置：网站首页>2D human posture estimation for posture estimation - numerical coordinate progression with revolutionary neural networks (dsnt)

2D human posture estimation for posture estimation - numerical coordinate progression with revolutionary neural networks (dsnt)

2022-06-10 15:49:00 【light169】

Reference resources ：【 Paper reading notes 】Numerical Coordinate Regression with Convolutional Neural Networks_ Time machine ﾟ The blog of -CSDN Blog

Numerical Coordinate Regression= Gauss heat map VS Coordinate regression _ Zhu Xiaomeng's blog -CSDN Blog

Address of thesis ：Numerical Coordinate Regression with Convolutional Neural Networks
Code address ：GitHub - anibali/dsntnn: PyTorch implementation of DSNT

One 、 A summary of the paper

This paper provides a kind of An idea of learning coordinates directly from images . Now the mainstream methods are based on Gaussian kernel processing heatmap As a supervisor , But what this method has learned heatmap, In the process of obtaining coordinates by post-processing , There is quantization error （ such as 4 Times lower sampling Of heatmap, The expectation of quantization error is 2）.

This paper presents a new processing method , be called differentiable spatial to numerical transform(DSTN), adopt DSNT Handle （ No additional parameters added ）, Directly supervise the coordinates .DSNT It's right heatmap Processed , The train of thought is shown in the figure below . Final processing , Will be heatmap adopt softmax, Get one be based on heatmap Probability distribution of , And through this A probability distribution , And The default is X,Y（ Axis ） Do point multiplication , Get the expected value of coordinates . Monitoring loss is also based on this expectation .

Although the ideas in this article , It is mainly the regression of coordinates directly , But in practice , Or right heatmap Constrained , And the weight is not small . Think about it from another angle , In fact, the actual operation of this article , Or you could say , It's right heatmap Did supervision , Then a coordinate regularization factor is added . The supervision of the regularization term , Can effectively reduce heatmap Quantization loss converted to coordinates , With some direct to heatmap The loss error caused by regression is inconsistent with the expectation . however , This heatmap The loss of items is also carefully selected , Not even adding heatmap Loss item , More than many heatmap The result of the loss calculation method is better .

however , For those keys that do not exist in the image （ Such as half body ）, And many people ,DSNT Can not be solved directly . For some scenarios , This is an inevitable problem .

Two 、 Innovation points

The current numerical coordinate regression tasks exist in a large number of practical needs , For example, human key point detection 、 Face key point detection 、 Object key point detection and 3d Posture , The essential tasks of these problems can be summarized as Numerical coordinate regression , Therefore, this paper studies a problem of this kind Common solutions , Not for specific tasks , But for the sake of comparison , In this paper, the human posture estimation task is used to illustrate .

say concretely

At present, the key point of the mainstream is to return to two approaches ：

(1) Use the whole connection layer to directly return the coordinate points , for example yolo-v1. The advantage of this method is that the output is the coordinate point , Training and forward speed can be fast , And end-to-end total differential training ; The disadvantage is the lack of spatial generalization ability , In other words, the spatial information on the feature map is lost .

Spatial generalization It refers to the ability of knowledge acquired at one location during model training to be extended to another location in the reasoning stage , for instance , If I have a ball in the upper left corner of the picture during training , But the ball was in the lower right corner during the test , If the network can detect or recognize , So the model has the ability of spatial generalization . It can be seen that the coordinate point regression task needs this ability very much , Because I can't train every picture in every position . The reason why the full convolution model has this ability is the weight sharing , However, for the full connection layer , stay 2014 Year of Network in network Paper points out fully connected layers are prone to overfitting, thus hampering the generalization ability of the overall network. That is to say, the ability of spatial generalization will be greatly damaged if the full connection output coordinate point method is adopted , In fact, it is easy to analyze theoretically ： There is a ball in the upper left corner of the picture during the training period ,reshape After drawing into a one-dimensional vector , The activation weight of the full connection layer is all in the upper part , The weight of the lower half is not trained , When you test, enter a ball in the lower right corner of the picture , After drawing into a one-dimensional vector , Due to the weight failure of the lower half , Theoretically, it cannot be predicted , That is, there is no spatial generalization ability . The convolution operation is due to weight sharing , Can be effectively avoided . To sum up ： The weight obtained by full connection method depends heavily on the distribution of training data , It is very easy to cause over fitting , This phenomenon is really serious when I predict key projects .

All connected to the network （fully connect）, That is, each element unit and hidden layer Neurogen Make a full connection . The parameter quantity becomes W(width)*H(Height)*N(Hidden Nodes)

Weight sharing means using the same filter Go and scan the image , It is equivalent to presenting a feature , Get one feature map.

No weight sharing

The number of parameters of convolution kernel is consistent with the size of image pixel matrix , namely W(width)*H(Height)*C(Channels) .

(2) Prediction Gauss Thermogram The way , then argmax The index corresponding to the peak value is the coordinate point , for example cornernet、grid-rcnn and cpn wait . Take single person pose estimation as an example , The output is a picture of just one person , The input is a Gaussian heat map of all the keys ,label It is a gaustu generated based on each key point . If everyone wants to return 17 A key point , Then the predicted output characteristic graph is (batch,h_o,w_o,17), That is, each channel is a heat map that predicts a joint point , Each channel is then argmax You can get integer coordinates .
The output method based on Gauss heat map will have higher accuracy than direct regression of coordinate points , The reason is not that the output mode of Gauss heat map is well expressed , But because its output characteristic diagram is large , Strong spatial generalization ability leads to , So naturally it can be explained if I still use (1) The method of direct regression coordinates predicts , But I no longer use full connection , However, in the full convolution method, the accuracy is still lower than that of the Gaussian heat map , The reason is that even the full convolution output , But like yolo-v2、ssd When the output characteristic diagram is very small , As a result, the spatial generalization ability is not as good as the method (2).
From a numerical point of view , It must be a good way to directly return to the coordinate points , Because if you directly return to the coordinate point , The output is a floating point number , No loss of accuracy , And the Gaussian heat output must be an integer , This involves a problem of lower bound of theoretical error . Suppose the input picture is 512x512, The output is reduced 4 Double is 128x128, So suppose a key point location is 507x507, So shrink 4 After The Times , Even if the Gaussian heat map is restored without any error , There will also be a maximum 507-126*4=3 Pixel error , This 3 Is the lower bound of theoretical error . If the reduction factor increases , Then the lower bound of theoretical error will rise . So at present, most of the practices consider the compromise between speed and accuracy , Use zoom out 4 Times the way .

The advantage of this approach is that the accuracy is usually higher than Direct regression coordinate points of the whole connection layer Method ; The disadvantages are obvious , It is not a fully differential model from input to output , Because from heatmap To coordinate , It's through argmax It is obtained offline ( Actually, since argmax Do not guide , Then use soft argmax Instead , Some papers do ). And because the output characteristic diagram required is very large , Training and forward speed are slow , And memory consumption is large .

stay heatmap In the process of generating coordinates , shortcoming ：（1） The use of argmax And so on , Is non differentiable , No direct learning ;（2）heatmap To the coordinates , There are quantization errors .heatmap The larger the down sampling multiple with the input resolution , The larger the quantization error . Even more remarkable , Supervision is based on heatmap Upper , This will result in a loss function with our metric （ In coordinates ） They are separated from each other . In reasoning , We only use one of them （ Some ） Numerical coordinates of pixels are calculated , But in training , Loss of all pixels .

The first picture is target Thermogram , The second picture and the third picture are two hypothetical predictions , Under normal circumstances , The second prediction is more accurate , But in fact, if mse loss, So the third graph loss Smaller than the second pair , So there's a problem , The key points that will lead to the prediction are inaccurate .
To sum up , Although the accuracy of Gaussian heat map prediction is usually higher than that of regression method , But it has several very troublesome problems ：(1) The output graph is very large , This leads to excessive memory consumption 、 Reasoning and training are slow ;(2) There is a lower bound of theoretical error ;(3) mse loss It may lead to the deviation of the learning results ;(4) Not a total differential model ;

The following table shows heatmap,fully connection,DSNT The advantages and disadvantages of three methods to obtain coordinates . As can be seen from the table ,heatmap Not total differential , Poor performance at low resolution ;fully connection, No spatial generalization capability , And easy to fit ; and DSNT Has all the advantages .

Personal opinion ： The reason DSNT You can get the coordinates directly , At the same time, it has the ability of spatial generalization , It's two points ：（1） Its Yes heatmap Supervised , The monitoring object is Gaussian distribution , Have symmetry ;（2） It is relative to the axis object X,Y Well designed , Namely 1* n and n*1 The unidirectionality of , Make it symmetrical in two coordinate axes .

For the advantages and disadvantages of the above two mainstream methods , Can we design a model , At the same time have Method (1) Total differential training , also Have (2) Spatial generalization ability of . So this paper designs a differentiable spatial to numerical transform(DSTN) Module to make up for the two gap, And the designed module has no training parameters , It can be predicted on low resolution Gauss map , The main function is to let the gradient flow from the coordinate point to the Gaussian heat map , Without adding additional parameters and calculation .

The figure below shows the use of heatmap supervise , And use DSNT To supervise the coordinates heatmap The difference between （ That's right heatmap No supervision , But it's not , A regularization term is added ）,DSNT The results are more concentrated .

heatmap In the input to DSNT Before , To be normalized , become Probability distribution .normalized It means nonnegative , Add up to 1. There are four normalization attempts in the following table , The final choice softmax Normalized as rectifier Function of .

Four 、 Network structure

Let's start with the concrete implementation . First of all, clear DSNT The input and output of the module , hypothesis CNN The original input is (batch,h,w,3), The output is (batch,h//4,w//4,17) Express 17 Key point regression , Express with ,DSNT Acting on each channel , The output is (batch,17,2) Express 17 A key point x,y coordinate .

(1) The Gaussian heat map output from each channel is analyzed normalized, Defined as $\hat Z$ .

Ignore normalized denominator , It can be expressed as $\hat Z=\phi(Z)$ , The author designed 4 Kind of normalized methods , Specific for ：

Insert picture description here
To see , Choose different normalization methods , The final result is not far away , But because the table softmax Better means , So the author chose softmax As normalized function .

Why should we normalize to 0~1 Well ？ Because the author wants DSNT The input is a discrete probability distribution , Behind the useful .

(2) Convert to coordinate points

First define two matrices X and Y, Its width, height and input DSNT The width is the same as the height , The specific numerical calculation is ：
Insert picture description here

5、 ... and 、Loss

loss The design is simple , Because it is a coordinate point regression , So a very natural loss It must be the sum of squared errors , namely ：

Insert picture description here
p yes label, yes DSNT Module output value .
I hope you will know why Loss It's just a two norm between two points , But the gradient acts on the whole heat map , The reason is that the predicted value of this output is actually the mean value , The position that contributes to the mean value is the whole graph , Therefore, when the gradient is reversed, it will act on the whole heat map .

If only the above Loss It must be possible work Of , But if you don't add any constraints , Then the same key point will appear and generate heat maps of various shapes , Although it has no impact on our final results , But I always feel too free . For example, the network may learn a Gaussian heat map with very large variance and very small variance , In terms of stability , So in reality , We definitely need a Gaussian heat map with small variance , And in the experiment, it is found that introducing some priors is helpful to network training . So we have to discuss how to add a priori .

The simplest and most direct way to add a priori is to regularize Loss, Therefore, we can introduce a Gaussian heat map prior to the regular term . The overall loss by ：
Insert picture description here
(1) Variance regularity
The first kind of regularity you can think of is controlling variance , As mentioned earlier, we hope to learn Gauss heat map , And the variance should be small , So naturally we can write ：

The first equation is to calculate the variance , It is only related to the predicted coordinates , The second is the final variance canonical term , By giving x and y Variance constraints of the same magnitude in the direction , It is helpful to learn Gauss heat map .

(2) The distribution is regular

In fact, it's easy to think of another better regular . What you want is to get the Gauss heat map ？ So since you've transformed into a probability distribution , So I want his output distribution to be bivariate Gauss , Then you must use KL Divergence ,KL Divergence is largely used to measure the similarity between two distributions , stay GAN It is widely used in . So we have ：
Insert picture description here
Is a bivariate Gaussian distribution , This design can force the network to learn Gauss heat map . because KL Divergence has ：1. Always greater than or equal to 0, But not necessarily less than 1; 2. Asymmetry , So the author introduces a better KL Divergence variant ：Jensen-Shannon, It is KL Variation of divergence , The range of values is 0~1, And it is symmetrical .
In fact, we can consider it from another angle , for example SVM Loss equally , We can think of coordinate prediction as loss It's regular Loss, And the distribution is regular loss It is the Lord. Loss, From this point of view , In fact, the method proposed in this paper is not different from the original direct Gauss heat map regression , Or directly regression Gauss heat map , However, an additional coordinate regression regular term is added , In order to get the coordinates , Introduced DSNT It's just a module .

The author compares through experiments , give the result as follows ：
Insert picture description here
It can be seen that JS Better , So the final choice of regularity is JS The distribution is regular .

Insert picture description here
The above figure shows the heat map shape learned under different variance sizes , It can be seen that JS Regular effect is good , But it can be seen that the variance has a great influence , This is very critical in a specific project , Need careful design . It can also be seen that in the variance regularity , Because it only limits the equal variance , The mean for 0, So he can't learn the Gauss heat map directly , Instead, the heat map is divided into four spots around the joint .
Insert picture description here