当前位置：网站首页>2D human posture estimation for posture estimation - simple baseline (SBL)

2D human posture estimation for posture estimation - simple baseline (SBL)

2022-06-10 15:48:00 【light169】

Address of thesis ：Simple Baselines for Human Pose Estimation and Tracking
Code address ：GitHub - leoxiaobin/pose.pytorch: Simple Baselines for Human Pose Estimation and Tracking

Simple Baselines, yes 2018 year MSRA The job of , The network structure is shown in the figure below . It's called , Because this network is really simple . The network is in ResNet On the basis of head, This head Just a few deconvolutional layer, For lifting ResNet Output feature map The resolution of the , We have mentioned many times that high resolution is the need of attitude estimation task . there deconvolutional layer It's a less rigorous statement , Read the source code ,deconvolutional layer It's actually going to be transpose convolution、BatchNorm、ReLU Encapsulated into a structure . So the point is transpose convolution, Think of it as convolution The inverse process .
From the picture, we can see Simple Baselines The network structure of is a little similar Hourglass One of them module, But it can be found that ：① The network does not use similar Hourglass Medium skip connection;② The network is single-stage,Hourglass yes multi-stage Of . But it's amazing , The network is more effective than Hourglass. Personally, I think there are two reasons , One is Simple Baselines With ResNet As backbone, Compared with feature extraction ability Hourglass stronger . Two is Hourglass The middle and upper sampling uses a simple nearest neighbor upsampling, And here we use deconvolutional layer, The latter works better （ You can see in the back MSRA Of Higher-HRNet This structure is still used in ）.

SBL Network structure

SBL（Simple Baseline） [7] It provides a set of methods for human posture estimation Benchmark framework .SBL stay Backbone network Followed by The deconvolution module is used to predict the heat map , Is in the ResNet Then add a few layers Deconvolution Directly generate thermodynamic diagram . Compared with other models , Is the use of Deconvolution Supersampling structure is replaced . The up sampling and convolution parameters are combined into the deconvolution layer in a simpler way , Instead of using jump layer connection .

Hourglass、CPN、SBL What they have in common is , Three up sampling steps and three levels of nonlinearity are adopted ( From the deepest features ) To obtain high-resolution feature map and heatmap

Above picture a yes Hourglass The Internet ,b yes CPN,c It's in this article SimplePose, The complexity of the structure can be seen directly
The first two structures need to construct pyramid feature structure , Such as FPN Or from Resnet structure
SimplePose There is no need to build a pyramid feature structure , It's directly in Resnet The deconvolution module is designed and the result is output , It's from deep and low The simplest way to generate heat map based on resolution feature
Specific structure ： First ： stay Resnet On the basis of , Take the last residual module and output the characteristic layer （ name C5） then ： Followed by three deconvolution modules （ Each module is ：Deconv + batchnorm + relu, Deconvolution parameters ,256 passageway ,4X4 Convolution kernel ,stride by 2,pad by 1）, Last ： use 1X1 Convolution layer generation k Output thermal diagram of key points .
Mean square error （MSE） It is used to predict the loss between the heat map and the target heat map
Through the application to the k With two joints GT Centered 2D Gauss function , Generate k Target heat map of the joint .

In these models , It can be seen that How to generate high-resolution feature map It is a key of attitude estimation ,SimplePose use Deconv Expand the resolution of feature map ,Hourglass,CPN It is used in upsampling+skip The way ; Of course, it is difficult for us to judge which way is good based on this example

Description of attitude tracking problem

ICCV’17 PoseTrack Challenge[2] The winner of [11] It solves the problem of multi person pose tracking , use first Mask RCNN[12] Estimate the posture of human body in the frame , Then the greedy bipartite graph matching algorithm is used for on-line tracking frame by frame .
This greedy matching algorithm , In a nutshell , In the first frame of the video, each detected person is given a id, Then the person detected in each subsequent frame is measured in a certain way with the person detected in the previous frame （ What is mentioned in this paper is the method of calculating the detection frame IOU） Calculate a similarity , Put the ones with high similarity （ Greater than threshold ） As one id, And delete . Repeat the above steps , Until there is no instance similar to the current frame , At this point, assign a new... To the remaining instances id.

The method proposed in this paper retains the main process of this method , On this basis, two improvements are put forward ：
First, in addition to detecting the network , The optical flow method is also used to supplement some detection frames , To solve the problem of missing detection in the detection network （ For example 2 The person on the far left of the network is not detected by the detection network ）.
The two is the use of Object Keypoint Similarity (OKS) Replace the detection box IOU To calculate the similarity . This is because when people move faster , use IOU It may not be reasonable .
OKS Is a measure of key point distance , The calculation method is as follows ：

The new similarity calculation method proposed in this paper is to use the optical flow method to calculate the position where the key points of one frame will appear in another frame , Then calculate the distance between the calculated position and the key points detected in this frame OKS, Take this as the similarity value of different people between two frames .

Joint Propagation using Optical Flow

If a single image level detector is simply used in video ( Such as fast - rcnn [27], R-FCN[16]), Motion blur and occlusion are introduced into video frames , It may lead to missed detection and false detection . Pictured 2 Shown , Due to rapid movement , The detector missed the black man on the left . Time information is often used to produce more reliable detection [36,35].

We recommend using time information represented by optical flow , Generate pedestrian frames from nearby frames for processing frames .

The specific method is ： Given $I^{k-1}$ An instance at frame i , There is a set of key points $J^{k-1}_i$ as well as $I^{k-1}$ and $I^{k}$ The optical flow field between $F_{k-1 \rightarrow k}$ , We can estimate the corresponding key point coordinate set $\hat{J}_{i}^{k}$ . specifically , Is for $J^{k-1}_i$ Medium joints Location (x,y), The next frame may be ( x + δ x , y + δ y ), among δ x and δ y Is in (x,y) Flow field value at （ ﬂow ﬁeld values ）. When we calculate $\hat{J}_{i}^{k}$ The boundary of the , And after extending it Box As candidated box . The extended value used in the experiment is 15 % .
When due to motion blur or occlusion , After the pedestrian detector fails to detect the current frame , We can use the data propagated from the previous frame Boxes , People who miss will be detected by these boxes . Pictured 2 (c ) Shown , For the black man on the left , Because we are in the picture 2(a) There is the tracking result of the previous frame in , So the spread of box Successfully included this person .

Flow-based Pose Similarity

Use bounding boxes IoU (Intersection-over-Union) As a measure of similarity ( $S_{Bbox}$ ) There may be problems connecting instances , One is when the instance moves very fast , These boxes don't overlap ; Second, in a crowded scene , Instances in the close box are not necessarily related . A more granular measure can be attitude similarity ( $S_{Pose}$ ) , It uses object key similarity (OKS) Calculate the body joint distance between two instances . When different frames , The same person's posture may change , At this time, pose similarity will also cause problems . therefore , We propose to use a flow based attitude similarity measure .

Given $I^{k}$ An instance key at frame $J^{k}_i$ and $I^{l}$ l Instance at frame $J^{l}_j$ , The attitude similarity measure based on flow is expressed as ：
$S_{Flow}\left(J_{i}^{k}, J_{j}^{l}\right)=OKS\left(\hat{J}_{i}^{l}, J_{j}^{l}\right)$

among OKS Indicates the relationship between two body postures Object Keypoint Similarity (OKS) Calculation . about $J^{k}_i$ example , According to the optical flow field $F_{k \rightarrow l}$ Calculation $I^{l}$ Frame time correspondence , Write it down as $\hat{J}_{i}^{l}$

Due to shielding with others or objects , Pedestrians often disappear , Then again . Considering two consecutive frames is not enough . therefore , We have considered multi frame stream based pose similarity , Write it down as $S_{Multi - flow}$ , This means spreading $\hat{J}^k$ From multiple previous frames . In this way , We can even relink instances that disappear in the middle frame .

Flow-based Pose Tracking Algorithm

First Solve the problem of attitude estimation . For the current processing frame , The detection frame is composed of a pedestrian detector and a frame obtained by using optical flow in the previous frame , And non maximal inhibition （NMS） operation . Then the clipped and scaled images are sent to the attitude estimation network for attitude estimation .
Solve tracking problems . We store the tracked instances in a double ended queue (Deque) Q in ： $Q=\left[\mathcal{P}_{k-1}, \mathcal{P}_{k-2}, \ldots, \mathcal{P}_{k-L_{Q}}\right]$

First , We solved the pose estimation problem .
For processing frames in video , Use bbox Non maximum inhibition （NMS） Operate to unify... From human probes box And use optical flow Propagate the generated by the joint from the previous frame box. from progagating joints Produced boxes As a supplement to detector missing detection
Then through our proposed pose estimation network , Take advantage of these boxes Human pose estimation is performed on the cropped and resized image
secondly , Solved the tracking problem . We store the tracked instances in a file with a fixed length LQ The two terminal queue （Deque） in , Expressed as

among $P_{k-i}$ Indicates that in the previous frame $I^{k-i}$ The tracked instance set in , Q The length of Indicates the number of previous frames considered when performing matching .
Q It can be used to capture the previous multi frame link relationship , Initialize in the first frame of the video . For the first k frame , We calculate untracked body joints （id by none） And Q Between previous instances flow Pose similarity matrix based on flow $M_{sim}$ . Then through greedy matching and $M_{sim}$ by Each of the bodyjoints example J Distribute id , Get the specified instance set . Last , We add the second k Frame instance set To update the tracked instance Q.