Human Pose Estimation Reading List

CNN Cascaded Architecture

DeepPose: Human Pose Estimation via Deep Neural Networks. (2013)
The pose estimation is formulated as a DNN-based regression problem towards body joints.

IEF: Human Pose Estimation via Deep Neural Networks. (2013)
This paper propose a framework that expands the expressive power of hierarchical feature extractors to encompass both input and output spaces, by introducing top-down feedback. Instead of directly predicting the outputs in one go, this paper use a self-correcting model that progressively changes an initial solution by feeding back error predictions, in a process called Iterative Error Feedback (IEF).

CPM: Convolutional Pose Machines. (2016)
The contribution of this paper is to implicitly model long-range dependencies between variables in structured prediction tasks such as articulated pose estimation. This is achieved by designing a sequential architecture composed of convolutional networks that directly operate on belief maps from previous stages, producing increasingly refined estimates for part locations, without the need for explicit graphical model-style inference.

Recurrent Human Pose Estimation. (2016)
This paper proposes a model regresses a heatmap representation for each body keypoint, and is able to learn and represent both the part appearances and the context of the part configuration. The model combining a feed-forward module with a recurrent module, where the recurrent module can be run iteratively to increase the effective receptive field of the network and thus improve the performance.

CPHR: Human pose estimation via Convolutional Part Heatmap Regression. (2016) ☻
This paper propose a detection-followed-by-regression CNN cascade. The first part of our cascade outputs part detection heatmaps and the second part performs regression on these heatmaps.

Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation. (2014) ☻
This paper proposes a new hybrid architecture that consists of a deep Convolutional Network and a Markov Random Field.

Human Pose Estimation using Deep Consensus Voting. (2016) ☻
This paper propose a novel approach where each location in the image votes for the position of each keypoint using a convolutional neural net.
End-to-End Learning of Deformable Mixture of Parts and Deep Convolutional Neural Networks for Human Pose Estimation. (2016 CVPR) ☻
This paper to incorporate the DCNN and the expressive mixture of parts model into an end-to-end framework.

Structured Feature Learning for Pose Estimation. (2016 CVPR) ☻
This paper use CNN to obtain the feature map of each body joint and then use a bi-directional tree with stacked transform kernels to pass information about near neighbor joint. The final feature maps are obtained by concatenating two updated feature maps of bi-directional tree. The score map of joint k is predicted from the combined feature maps through 1*1 convolution across feature maps.

Multi-Context Attention for Human Pose Estimation. (2017) ☻
This paper propose to incorporate convolutional neural networks with a multi-context attention mechanism into an end-to-end framework for human pose estimation. This paper adopt stacked hourglass networks to generate attention maps from features at multiple resolutions with various semantics. The Conditional Random Field (CRF) is utilized to model the correlations among neighboring regions in the attention map. This paper further combine the holistic attention model, which focuses on the global consistency of the full human body, and the body part attention model, which focuses on the detailed description for different body parts. Additionally, this paper design novel Hourglass Residual Units (HRUs) to increase the receptive field of the network.

Associative Embedding: End-to-End Learning for Joint Detection and Grouping. (2016) ☻
This paper introduce associative embedding, a novel method for supervising convolutional neural networks for the task of detection and grouping. The network outputs both a heatmap of per-pixel detection scores and a heatmap of per-pixel identity tags. The detections and groups are then decoded from these two heatmaps.

Learning Feature Pyramids for Human Pose Estimation. (2017)
This paper design a Pyramid Residual Module (PRMs) to enhance the invariance in scales of DCNNs. Given input features, the PRMs learn convolutional filters on various scales of input features, which are obtained with different sub-sampling ratios in a multi-branch network. This paper also provide theoretic derivation to extend the current weight initialization scheme to multi-branch network structures.

Top-down Approach: first detect individual people and then estimate each person’s pose.

RMPE: Regional Multi-Person Pose Estimation. (2016)
This paper propose a novel regional multi-person pose estimation (RMPE) framework to facilitate pose estimation in the presence of inaccurate human bounding boxes. The framework consists of three components: Symmetric Spatial Transformer Network (SSTN), Parametric Pose Non-Maximum-Suppression (NMS), and Pose-Guided Proposals Generator (PGPG). The symmetric STN + parallel SPPE was introduced to enhance SPPE when given imperfect human proposals.

Towards Accurate Multi-person Pose Estimation in the Wild. (2017)
This paper propose a simple, yet powerful, top-down approach consisting of two stages. In the first stage, we predict the location and scale of boxes which are likely to contain people; for this we use the Faster RCNN detector. In the second stage, we estimate the key-points of the person potentially contained in each proposed bounding box. For each key-point type we predict dense heatmaps and offsets using a fully convolutional ResNet. We also use a novel form of key-point-based Non-Maximum-Suppression (NMS), instead of the cruder box-level NMS, and a novel form of key-point-based confidence score estimation, instead of box-level scoring.

CPN: Cascaded Pyramid Network for Multi-Person Pose Estimation. (2017)
This paper includes two stages: GlobalNet and RefineNet. GlobalNet is a feature pyramid network which can successfully localize the “simple” keypoints like eyes and hands but may fail to precisely recognize the occluded or invisible keypoints. The RefineNet tries explicitly handling the “hard” keypoints by integrating all levels of feature representations from the GlobalNet together with an online hard keypoint mining loss.

Bottom-up Approach: detect individual body joints and then group them into individuals.

Associative Embedding: End-to-End Learning for Joint Detection and Grouping. (2016)
Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. (2016) ☻