Body Pose Estimation¶
BodyPoseNet is an NVIDIA-developed multi-person body pose estimation network included in the TAO Toolkit. It aims to predict the skeleton for every person in a given input image, which consists of keypoints and the connections between them. BodyPoseNet follows a single-shot, bottom-up methodology, so there is no need for a person detector. And unlike top-down methodology, the compute does not scale linearly with the number of people in the scene. The pose/skeleton output is commonly used as input for applications like activity/gesture recognition, fall detection, and posture analysis, among others.
The default model predicts the following 18 keypoints:
nose, neck, right_shoulder, right_elbow, right_wrist, left_shoulder, left_elbow, left_wrist,
right_hip, right_knee, right_ankle, left_hip, left_knee, left_ankle, right_eye, left_eye, right_ear, left_ear
Model Architecture¶
This is a fully convolutional model with architecture consisting of a backbone network (like VGG), an initial estimation stage, which does a pixel-wise prediction of confidence maps (heatmaps), and part affinity fields followed by multistage refinement (0 to N stages) on the initial predictions.
Training algorithm¶
The training algorithm optimizes the network to minimize the loss on confidence maps (heatmaps) and part affinity fields for given image and ground-truth pose labels. This model can be trained using the Body Pose Estimation training app in TAO Toolkit v3.0.
Reference¶
Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, Yaser Sheikh (2017). Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.