当前位置：网站首页>2D human pose estimation with residual log likelihood estimation (RLE) [link only]

2D human pose estimation with residual log likelihood estimation (RLE) [link only]

2022-06-10 15:49:00 【light169】

【 Reference resources 】 Focus on Chapter 4

[ICCV2021 Oral] Learn the potential error distribution ——Human Pose Regression with Residual Log-likelihood Estimation(RLE) Paper notes - You know
RLE After recasting the glory of the regression method , Where are the similarities and differences between regression and heat map ？| Attitude estimation ICCV2021 Post reading experiment - You know
Zero basic understanding RLE（Residual Log-likelihood Estimation）| Attitude estimation ICCV 2021 Oral - You know
Flow based generation model -Flow based generative models - You know Mr. Li Hongyi note

Take the error in training as a sample , utilize MLE Maximum likelihood estimation and Flow-based Generate models to learn the potential error distribution

This is a ICCV2021 Oral A paper on , I happened to see it while following up the latest paper on attitude estimation . The core idea of this paper is the content cited above , Although this paper is aimed at human posture estimation , But its Learning error distribution The idea can be extended to any task .

1. Let's start with the Gauss heat map

as everyone knows , Attitude estimation is divided into Coordinate regression and Thermogram regression Two factions , What I started with was the return of the heat map . The heat map in the heat map regression has always been designed by hand $\sigma$ Two dimensional independent Gaussian distribution of . For example, suppose the output heat map A resolution of 64×64, be $\sigma$ by 2, Somewhere on the heat map The value of is

among $\mu =(\mu_x,\mu_y)$ Is the real coordinate point of the key point in the input picture .

Why use Gaussian distribution ？ Until I read this paper and understood the rationality of Gaussian distribution heat map .

2. Coordinate error distribution

MSE "Mean Square Error"

Easy to understand , That can be regressed from the coordinates first L_2 The loss function looks like . Coordinate regression directly predicts the coordinates of feature points $\hat\mu$ （ Some also predict $\hat\sigma$ , Represents the variance of the error distribution ）, and L_2 The loss function is ：

$L_{coord} = (\mu-\hat\mu)^2$ (2)

Here we only focus on one key point , Like the left shoulder . This loss function is very intuitive , That is, let the predicted coordinate point approach the real coordinate point , It can be derived from maximum likelihood estimation .

2.1 Maximum likelihood estimation

[ Maximum likelihood estimation (Maximum likelihood estimation)]

Let the sample obey the normal distribution $N(\mu ,\theta ^{2})$ , Then the likelihood function is

We assume that the error $\epsilon = \mu-\hat\mu$ Meet expectations for 0、 The variance of $\hat\sigma$ Gaussian distribution of ：

As for why it can be assumed , See the last part of the answer below , The central limit theorem .

What is the essence of the least square method ？ - Ma's answer - You know What is the essence of the least square method ？ - You know

What is different from the least square method is , We predicted here $\hat\sigma$ , For different characteristic points / human body , The variance of the error distribution is different .

Output density

Understand... From another angle （ From the perspective of the original paper ）, We can argue that Coordinate regression prediction One. Gaussian distribution $p_\Theta (\mathbf{x})$ , Gaussian distribution The expectation is $\hat\mu$ 、 The variance of $\hat\sigma$ , That is, it is predicted that the coordinates of the real feature points are at a certain point X The probability of satisfies the distribution ：

Here we explain why the heat map in the heat map regression adopts Gaussian distribution . In fact, the sample we observed is $\mu$ , So using Maximum likelihood estimation Law , We need to maximize $p_\Theta (\mu)$ ：

If we assume that $\hat\sigma$ Is constant , You can start from (5) You get (1) type .

2.2 Is it really a Gaussian distribution ？（ The motivation of the thesis ）

Other works mentioned in the paper pointed out that , Use Laplacian distribution Assuming better performance , Corresponding Loss function . A distribution hypothesis closer to the true error distribution , Should have better performance . Therefore, the author proposes to use flow-based generativate model Come on Learn the potential error distribution .

3. The core idea

3.1 Flow-based generativate model

Flow-based generativate model, source ： Teacher lihongyi's courseware

First, a brief introduction Flow-based generativate model（ hereinafter referred to as Flow）, The goal is to train a generator G , A simple distribution $\pi (z)$ The samples in are transformed into complex distributions p_G(x) Sample in ：

For complex distributions p_G(x) The sample we observed is x^i （ For example, the real animation avatar in the animation avatar generation task ）, So according to the maximum likelihood estimation , We need to maximize the probability of these samples ：

According to the formula in the upper right corner of the picture , Available ：

There is no need to know why , Just know the generator G It is designed to be easy to find the inverse function . The whole training process is to combine all the observed samples with complex distribution x^i Seeking inverse $G^{-1}$ And maximize its probability .

The next question is How to convert simple distribution to complex distribution （ What is a variable , $\hat\mu$ still $\hat\mu-\mu$ etc. ）, This leads to three designs in the original paper .

3.2 Basic Design

Direct learning predicts expectations as $\hat\mu$ 、 The variance of $\hat\sigma$ Transform the Gaussian distribution of to the coordinates of feature points $\mu$ Potential distribution of , So when training with each $\mu$ For observed samples .

Basic design

What's wrong with this design ？ The original text says ：

Therefore, ϕ will learn to fit the distribution of $\mu$ across all images. Nevertheless, the distribution that we want to learn is about how the output deviates from the ground truth conditioning on the input image, not the distribution of the ground truth itself across all images

I think the reason why this design is not good may be the simple distribution of almost every training $(\hat\mu,\hat\sigma)$ All different , The complex distribution of coordinates of each feature point of each human body is also different . A complex distribution has just used a sample to train the mapping from this simple distribution to itself , The next time, a simple distribution is changed , The training samples are scattered .

3.3 Reparameterization

Learn to convert standard normal distribution to error $\epsilon =\frac{\mu-\hat\mu}{\hat\sigma}$ Potential distribution of , It is equivalent to using re parameterization technique to skillfully realize Basic Design The goal of , While avoiding Basic Design The problem that training samples are dispersed in .

Direct likelihood estimation with reparameterization

This design is equivalent to taking all errors in the training process as samples , But at first, the coordinate prediction error was very large 、Flow Just started training , Probably Flow Learned the wrong error distribution , and Regressor The error distribution of this error is also used （ Loss function ） As a guide , Cannot promote each other .

3.4 Residual Log-likelihood Estimation

Learn to convert standard normal distribution to error $\epsilon =\frac{\mu-\hat\mu}{\hat\sigma}$ The quotient of the potential distribution and the Gaussian distribution (Residual), It is mainly to solve the cold start problem mentioned in the previous design （ It should be possible to say that ？）.