当前位置：网站首页>An overview of 2D human posture estimation

An overview of 2D human posture estimation

2022-07-04 14:26:00 【Xiaobai learns vision】

Click on the above “ Xiaobai studies vision ”, Optional plus " Star standard " or “ Roof placement ”

 Heavy dry goods , First time delivery

come from | You know author | Xie Yibin

link | https://zhuanlan.zhihu.com/p/140060196

edit | Deep learning is a trivial matter official account

This article is reproduced with the authorization of the author , Please do not re forward

0. Preface

This article mainly discusses 2D Human posture estimation , The content mainly includes ： Introduction to basic tasks 、 The main difficulties 、 Methods and personal thinking on this problem . I hope you can read this article with critical eyes , Harmonious discussion .

1. Introduce

2D The goal of human posture estimation is to locate and identify the key points of human body , These keys are connected in joint order , You can get the trunk of the human body , You get the posture of the human body .

Before the era of deep learning , Like other computer vision tasks , It is with the help of carefully designed features to deal with this problem , such as pictorial structure. with CNN Strong feature extraction capability , The field of attitude estimation has made great progress .2D Human posture estimation can be mainly divided into single person posture estimation （Single Person Pose Estimation, SPPE） And multi person posture estimation （Multi-person Pose Estimation, MPPE） Two subtasks .

Single person posture estimation is the basis , In this case , All we have to do is give us a picture of one person , We need to find out all the key points of this person , frequently-used MPII The data set is the data set of single person attitude estimation .
In multi person pose estimation , What we get is a picture of many people , We need to find out the key points of everyone in this picture . For this question , Generally, there are top-down （Top-down） And bottom up （Bottom-up） The two methods .
- Top-down: （ From people to key points ） First use detector Find everyone in the picture bounding box, And then on a single person SPPE. The method is Detection+SPPE, Better accuracy can often be obtained , But it's slower .
- Bottom-up: （ From key points to people ） First use a model testing （locate） Show all the key points in the picture , Then group these key points （group） To everyone . This method can often be real-time , But the accuracy is poor .

2. difficulty

2D There are many difficulties in human posture estimation , Many problems can be solved by optimizing the network structure , More precisely , Is to use many scales 、 Multi resolution features .

Occlusion （ self-occlusion , Covered by others ）

Expand the feeling field , Let the network learn the blocked relationship by itself

People have different standards , Take pictures from different angles

Multi scale feature fusion

Various postures

Test the capacity of the network , depth

light

Add the factor of light change to the data preprocessing , Offset each channel

The occlusion problem is difficult to solve , A lot of work is also being done here . Except for the network structure , Data preprocessing is also important , about SPPE, Try to put people in the center of the picture . Post processing is also important , How to reduce the right heatmap Of argmax Quantization error of operation . The human body is different from other objects , There are spatial constraints between different key points of the human body , How do I get there? best capture various spatial relationships between human joints Is the key .

3. Method

This section mainly introduces the method of attitude estimation , Due to limited space , It is difficult to present the whole picture of the paper , You can read the original paper by yourself .2D Attitude estimation methods can be divided into single person attitude estimation （SPPE） And multi person posture estimation （MPPE） Two parts , Among them, the posture estimation of many people is divided into top-down （Top-down） And bottom-up （Bottom-up） Two kinds of .

3.1. SPPE

DeepPose (Google, 2014)[1]

DeepPose

Use AlexNet As backbone
Directly return the coordinates of joint points
Use a cascade structure to refine result

Joint training with CNN and Graphical Model（LeCun, 2014）[2]

Start using heatmap

CPM (CMU, 2016)[3]

Fully convolutional network , End to end training

CPM yes SPPE The first classic in , come from CPM Of Jia Deng Group （ Later, the same group improved OpenPose, Is currently the Bottom-up The most influential method among methods ）,CPM The innovation mainly lies in its proposed network structure .

There are many networks stage form , Take the second one stage For example , His input consists of two parts , One is the last one stage Predicted heatmap, One is self stage Obtained in feature map, That is to say, in every stage Do it all once loss Calculation , Doing so can make the network converge faster and help improve the accuracy .
Predicted by the previous level heatmap Can provide a wealth of spatial context, This is very important for the recognition of joint points
Officially opened e2e Learning times

Stacked Hourglass Network (Jia Deng Group , 2016)[4]

hourglass Far reaching , It is commonly used. backbone, today , The effect is still very good .

The main innovation of this work lies in the improvement of network structure , As you can see from the diagram , It looks like a stacked hourglass .

Each hourglass module contains a symmetrical process of down sampling and up sampling , every last box Both represent a sub module with cross layer connections

The network uses intermediate supervision , That is, the blue box in the figure is a prediction heatmap, This accelerates the convergence of the network and improves the actual effect

Fast Human Pose (2019)[5]

backbone yes hourglass, This article applies knowledge distillation to pose On the issue of , It's a good try

In pursuit of cost-effective, need compact Network structure
4-stage hourglass You can get 95% Of 8-stage The effect of
Half channel Count （128） It will only lead to 1% Of performance drop
Use distillation to enhance supervision ,Students learn knowledge from books (dataset) and teachers (advanced networks).

3.2. MPPE

3.2.1. Top-down

G-RMI (Google, 2017)[6]

Use Faster-RCNN As a human body detector , The attitude part is based on ResNet Estimated offset Come on refine

Faster-RCNN obtain bounding box after image cropping, Make all box Have the same aspect ratio , Then expand box To include more image context

Use ResNet As backbone To estimate heatmap and offset vector, Because I got heatmap After that, we often need another argmax To get the coordinates of the joint points , In this process, due to the down sampling process of the network ,heatmap The resolution is bound to be smaller than the original , So the coordinates will be offset , So I estimated another offset vector To compensate for this quantization error .
OKS-based NMS： The overlap of two candidate postures is measured based on the similarity of key points , In target detection, it is often based on IoU To do it NMS, But the output of attitude estimation is the key point , It is more suitable to measure with this attribute .

RMPE ( Hand it over to Lu Cewu's teacher group , 2017)[7]

Hand in the work of Lu Cewu's teacher group , It's a relatively good attitude estimation work in China ,AlphaPose Open source code has a wide impact .

SSTN(Symmetric Spatial Transformer Network)： In the inaccurate bounding box Extract the responsible area
STN Choose RoI
SPPE help STM Get the exact area
PNMS(Parametric Pose Non-Maximum-Suppression) To remove redundant posture
PGPG(Pose-Guided Proposals Generator)

Compositional Human Pose Regression (MSRA, 2018)[8]

MSRA Xiao Sun Group work , Start from the output representation , because heatmap The quantization error of representation and the effect of coordinate representation are not good , The author proposes to use bones as representation.

bones Than joints A more stable , And it can contain more geometric information , In order to avoid direct calculation bones Of MSE The cumulative error caused , The author proposes to consider long-distance targets , Will consider all between the two joint points bones The sum of the . The author reparameterizes the output representation , It has nothing to do with the network structure , It can also be used for 3D.

CPN ( Open vision ,2018)[9]

It's using top-down Methods , The network structure adopted is a U-Shape Structure , Because for key point positioning , Features of different scales will play different roles , Shallow features can help locate , Deep features can help identify what parts . And the positioning of key points is difficult and easy , therefore , A very natural idea , To find the key points that are difficult to locate , Do more with them , So there are difficult cases to dig and RefineNet.

The fusion of multi-scale feature information is a great goal of network design , Author use GlobalNet (global pyramid network, U-Shape) To deal with it easy keypoints,GlobalNet It includes down sampling and up sampling （ Interpolation non transposed convolution ） The process of .
Use RefineNet (pyramid refined network) To deal with it hard keypoints
OHKM(online hard keypoints mining) To find out hard keypoints, Similar to... In detection OHEM

MSPN ( Open vision ,2018)[10]

stay CPN On the basis of this, we have made some improvements , Be similar to stacked hourglass Structure , hold CPN Also stacked , They also proposed detector Accuracy is not very important , As long as it's enough .

Improvements in network structure

Simple Baseline (MSRA, 2018)[11]

MSRA Bin Xiao The work of others （ The follow-up also launched HRNet Series and HigherHRNet）, It's really simple, The author just put hourglass and CPN Medium upsample Partly with deconvolution Did , The author mentioned in his paper that he wanted to explore how good could a simple method be? Really good at writing .

It's not hard to see from the picture , This network structure is very simple , Author use Deconvolution To do the sampling , There is no cross layer connection between different feature layers in the network , And classic network structure Hourglass and CPN They are very concise .

MultiPoseNet (2018)[12]

The author uses two subnet, One for outputting keypont and segmentationd Of heatmap, The other is detector, Used to output human body bounding box, Then send these two outputs to PRN,Pose Residual Network in , To get the final pose. The most important part is PRN, The author said that they started from data Middle school learning pose structures, Then we can solve the problem of occlusion . The author did not make this part clear , ha-ha . The result given by the author is very good , In the single 1080ti On COCO Data sets can achieve 23FPS The speed of , The effect is also similar to SOTA Of Top-down Method competitive Of .

backbone That is, the part used to extract features is resnet And two FPN( The reason for using two is that there are two following subnet)
keypoint subnet Used to output keypoint and segmentation Of heatmap
person detect subnet Used to detect human body , It uses RetinaNet As detector
pose residual network Output the final pose, Say you are learning data Of pose structures Then it can effectively deal with the problem of occlusion

Deeply learned compositional models (2019)[13]

I don't quite understand this compositional It means , It probably takes the human body as a tree structure ,children Can help parent, That is what the article said bottom-up ( Not the one often mentioned ),parent It can also help children, That is what the article said top-down

compositional models

HRNet (MSRA, 2019)[14]

MSRA Bin Xiao Et al simple baseline And then another one . Networks often include down sampling and up sampling processes , The purpose of upsampling is to get high-resolution images , Based on this observation , This work always retains high-resolution images in one branch ,HRNet It's like ResNet Same general purpose backbone, But its impact on attitude estimation is relatively large , The effect is also better .

each scale Between each other fuse, It is not a series down sampling process , This preserves the original resolution feature, There will be good spatial Information .

Enhanced Channel-wise and Spatial Information ( Bytes to beat , 2019)[15]

This work is mainly about the improvement of network structure , The main innovation is to join channel shuffle And attention mechanisms .

Channel Shuffle Module (CSM): reshape-transpose-reshape, After this operation , hope feature It can be related to the context information of the channel .

spatial attention: （feature level） Hope the network is pay attention to task-related regions Not the whole picture .
Channel-wise Attention: (channel level) from SE-Net Learn from , It mainly includes GAP and Sigmoid Two steps , I hope the network can choose a better channel detect pattern.

Related Parts Help (2019)[16]

This article is very good , The author mentioned that the human body is not all keypoint It's all related , So let's use a shared feature To predict bad , According to mutual information Divide the key points of human body into five categories , Network first learn one shared feature, Then, we will divide five of these five categories branch, Study specific features for relates parts Evaluation: This method is very good , It is worth learning from , Relevant can help , uncorrelated keypoint Just use shared feature To predict the , Instead, it will lead to what is said in the article negative transfer

Related body parts hold keypoint Divided into multiple group, according to mutual information
Part-based branching network (PBN) learn specific features for each group of related parts

Crowd Pose ( Hand it over to Lu Cewu's teacher group ,2019)[17]

This work mainly deals with the problem of multi person pose estimation in crowded scenes , And in MPII, COCO and AI Challenger Based on the data set, a new crowd benchmark.

In a crowded scene , In the same box Inside , We may need to deal with many other people's key points , This work designed joints candidate loss To estimate multi-peak heatmaps, Let all possible joint points be candidates .
Person-joint Graph: joint node It is established by the distance between joint points ,person node Pass the test human proposals To establish the , in-between edge By seeing if there is contribution To establish the . This establishes a person - Articulatory graph, It turns to the problem of graph theory , The goal is to maximize the edge weight in the bipartite graph . Use updated Kuhn-Munkres Solve this problem .

Single-Stage Multi-Person Pose Machines (NUS, 2019)[18]

backbone yes hourglass, Propose graded SPR, hold person instance and position information Unified , So you can do single-stage. Estimate the human center root joint, Other joints are estimated by displacement To deal with it .

Hierarchical SPR
A unique root for each person
Several joint displacements for each joint
heatmap for root joint (L2 loss)
dense displacement map for each joint (smooth L1 loss)

3.2.2. Bottom-up

DeepCut (Germany, 2016)[19]

It was used Fast R-CNN and ILP, Relatively slow

pipeline（Bottom-up）
detect Detect key points of human body （Adapted Fast R-CNN） And express them as graph The nodes in the
label Use the human joint point category to classify the detected key points , such as arm, leg
partition Group the key points into the same person
Use pairwise terms To optimize

Associative Embedding (Jia Deng Group ,2016)[20]

be based on Jia Deng Before the Group stacked hourglass network It's done （ It also inspired the later group CornerNet） The author's insight very good ,Many CV tasks can be viewed as joint detection and grouping, Put forward tag heatmap Come on group Human body parts .

produce detection heatmaps and associate embedding tags together (bottom-up but single-stage) and then match detections to others that share the same embedding tag
The main work is to propose associate embedding tag, In other words, when predicting each joint point, it also predicts the value of this joint point tag value , Have the same tag The value is the joint point of the same person

DeeperCut (Germany, 2017)[21]

be based on DeepCut Improvement

Use deep ResNet Architecture to detect body part
Use image-conditioned pairwise terms To optimize , Many candidate nodes can be reduced , Judge whether the node is important by the distance between the candidate nodes

OpenPose (CMU, 2017)[22]

It can be done in real time , The author in CVPR The report on the website is directly presented to the audience with a notebook , Amazing , Not only can we estimate the posture , You can also estimate the face 、 The key points of hands and feet , That is, the whole body has been estimated . Is currently the Bottom-up The most influential work in the method .

The network structure is based on CPM improvement , The network consists of two branches , A branch prediction heatmap, Another branch prediction paf(part affine field),paf It is also the key to this work .
paf Is a vector field connected by two joint points , You can think of it as a limb , With paf Based on , hold group The problem of transforming text bipartite graph matching （bipartite graph） The problem of , Solve with Hungarian algorithm .

PersonLab (2018)[23]

This article is also estimated offset To do it refine, And it's also multi-task Of , Estimated short mid long Three offsets, Each has different functions .

short-range offsets to refine heatmaps
mid-range to predict pairs of keypoints
greedy decoding to group keypoints into instances

PifPaf (EPFL, 2019)[24]

The main contribution of this work is to propose PIF and PAF（ Not OpenPose Of PAF） These two vectors .

As you can see from the diagram , The network is based on ResNet,encoder Finally, two branches are output ,PIF and PAF Vector field .
PIF The vector field is 17x5, among 17 Is the number of joints ,5 Indicates for optimization heatmap Value .
PAF The vector field is 19x7, among 19 On behalf of 19 Kind of limb connection ,7 According to the confidence and offset To optimize the value of the limb vector .
The key points are PIF give , The connection between keys is made by PAF give , The next step is to use Greedy Decoding Conduct group The process .

HigherHRNet ( Bytes to beat ,2020)[25]

Bytes to beat Bin Xiao The work of the team , Based on the previous HRNet Work and associative embedding.

HRNet stay bottom-up Try in method ,associative embedding Plus a stronger network .

4. summary

The general innovation points mainly focus on network structure and feature representation , Network structure is a hole filled with dissatisfaction , How to better extract information , Using information is the essence of network structure design . The representation of output features mainly includes heatmap And custom vector fields , Artificially designed vectors may better guide network training . The joint points of the human body are not isolated , Making good use of this Apriori physical relationship can also better guide network training .

The good news ！

Xiaobai learns visual knowledge about the planet

Open to the outside world

 download 1：OpenCV-Contrib Chinese version of extension module 

 stay 「 Xiaobai studies vision 」 Official account back office reply ： Extension module Chinese course , You can download the first copy of the whole network OpenCV Extension module tutorial Chinese version , Cover expansion module installation 、SFM Algorithm 、 Stereo vision 、 Target tracking 、 Biological vision 、 Super resolution processing and other more than 20 chapters .


 download 2：Python Visual combat project 52 speak 
 stay 「 Xiaobai studies vision 」 Official account back office reply ：Python Visual combat project , You can download, including image segmentation 、 Mask detection 、 Lane line detection 、 Vehicle count 、 Add Eyeliner 、 License plate recognition 、 Character recognition 、 Emotional tests 、 Text content extraction 、 Face recognition, etc 31 A visual combat project , Help fast school computer vision .


 download 3：OpenCV Actual project 20 speak 
 stay 「 Xiaobai studies vision 」 Official account back office reply ：OpenCV Actual project 20 speak , You can download the 20 Based on OpenCV Realization 20 A real project , Realization OpenCV Learn advanced .


 Communication group 

 Welcome to join the official account reader group to communicate with your colleagues , There are SLAM、 3 d visual 、 sensor 、 Autopilot 、 Computational photography 、 testing 、 Division 、 distinguish 、 Medical imaging 、GAN、 Wechat groups such as algorithm competition （ It will be subdivided gradually in the future ）, Please scan the following micro signal clustering , remarks ：” nickname + School / company + Research direction “, for example ：” Zhang San  +  Shanghai Jiaotong University  +  Vision SLAM“. Please note... According to the format , Otherwise, it will not pass . After successful addition, they will be invited to relevant wechat groups according to the research direction . Please do not send ads in the group , Or you'll be invited out , Thanks for your understanding ~

原网站

版权声明
本文为[Xiaobai learns vision]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/185/202207041239013682.html