Body Pose Estimation - NVIDIA Docs

BodyPoseNet is an NVIDIA-developed multi-person body pose estimation network included in the TAO Toolkit. It aims to predict the skeleton for every person in a given input image, which consists of keypoints and the connections between them. BodyPoseNet follows a single-shot, bottom-up methodology, so there is no need for a person detector. And unlike top-down methodology, the compute does not scale linearly with the number of people in the scene. The pose/skeleton output is commonly used as input for applications like activity/gesture recognition, fall detection, and posture analysis, among others.

The default model predicts the following 18 keypoints:

Copy
Copied!

            
            nose, neck, right_shoulder, right_elbow, right_wrist, left_shoulder, left_elbow, left_wrist,
right_hip, right_knee, right_ankle, left_hip, left_knee, left_ankle, right_eye, left_eye, right_ear, left_ear

Model Architecture

This is a fully convolutional model with architecture consisting of a backbone network (like VGG), an initial estimation stage, which does a pixel-wise prediction of confidence maps (heatmaps), and part affinity fields followed by multistage refinement (0 to N stages) on the initial predictions.

Training algorithm

The training algorithm optimizes the network to minimize the loss on confidence maps (heatmaps) and part affinity fields for given image and ground-truth pose labels. This model can be trained using the Body Pose Estimation training app in TAO Toolkit v3.0.

Reference

Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, Yaser Sheikh (2017). Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

Intended use case

The primary use case for this model is to detect human poses in a given image. BodyPoseNet is commonly used for activity/gesture recognition, fall detection, posture analysis, etc.

More details are captured in its model card hosted at NGC.