PoseClassificationNet

PoseClassificationNet takes a sequence of skeletons (body poses) as network input and predicts the actions of one or more persons in those frames. The model supported in the current version is based on the spatial-temporal graph convolutional network (ST-GCN), which is the most commonly used baseline for skeleton-based action recognition due to its simplicity and computational efficiency. Unlike pixel-based action recognition, ST-GCN is able to exploit the local pattern and correlation from a spatial-temporal graph of human skeletons. This model can be used to train graph convolutional networks (GCNs) for other purposes through transfer learning. Newer architectures with state-of-the-art performance will be released in the future. TAO Toolkit provides the network backbone for 3D poses.

Preparing the Dataset

PoseClassificationNet requires a sequence of skeletons (body poses) for input. The coordinates need to be normalized. For example, 3D joints are produced relative to the root keypoint (i.e. pelvis) and normalized by the focal length (1200.0 for 1080P). The entrypoint for dataset conversion generates an array of spatio-temporal sequences based on the output JSON metadata from the deepstream-bodypose-3d app.

The input data for training or inference are formatted as a NumPy array in five dimensions (N, C, T, V, M):

N: The number of sequences
C: The number of input channels, which is set to 3 in the NGC model
T: The maximum sequence length in frames, which is 300 (10 seconds for 30 FPS) in the NGC model
V: The number of joint points, set to 34 for the NVIDIA format
M: The number of persons. The pre-trained model assumes a single object, but it can also support multiple people

The output of model inference is an array of N elements that gives the predicted action class for each sequence.

The labels used for training or evaluation are stored as a pickle file that consists of a list of two lists, including N elements each. The first list contains N strings of sample names. The second list contains the labeled action class ID of each sequence. The following is an example:

Copy
Copied!

            
            [["xl6vmD0XBS0.json", "OkLnSMGCWSw.json", "IBopZFDKfYk.json", "HpoFylcrYT4.json", "mlAtn_zi0bY.json", ...], [235, 388, 326, 306, 105, ...]]

The graph to model skeletons is defined by two configuration parameters:

graph_layout (string): Must be one the following candidates:
- nvidia consists of 34 joints. For more information, please refer to AR SDK Programming Guide.
- openpose consists of 18 joints. For more information, please refer to OpenPose.
- human3.6m consists of 17 joints. For more information, please refer to Human3.6M.
- ntu-rgb+d consists of 25 joints. For more information, please refer to NTU RGB+D.
- ntu_edge consists of 24 joints. For more information, please refer to NTU RGB+D.
- coco consists of 17 joints. For more information, please refer to COCO.
graph_strategy (string): Must be one of the following candidates (for more information, refer to the “Partition Strategies” section in this paper):
- uniform: Uniform Labeling
- distance: Distance Partitioning
- spatial: Spatial Configuration

Note

All-in-one scripts are provided for processing Kinetics and self-annotated NVIDIA datasets. The preprocessed data and labels of the NVIDIA dataset can be accessed here.

Creating an Experiment Spec File

The spec file for PoseClassificationNet includes model, dataset, and train parameters. Here is an example spec for training a 3D-pose-based model on the NVIDIA dataset. It contains six classes: “sitting_down”, “getting_up”, “sitting”, “standing”, “walking”, “jumping”:

Copy
Copied!

            
            model:
  model_type: ST-GCN
  pretrained_model_path: "/path/to/pretrained_model.pth"
  input_channels: 3
  dropout: 0.5
  graph_layout: "nvidia"
  graph_strategy: "spatial"
  edge_importance_weighting: True
dataset:
  train_dataset:
    data_path: "/path/to/train_data.npy"
    label_path: "/path/to/train_label.pkl"
  val_dataset:
    data_path: "/path/to/val_data.npy"
    label_path: "/path/to/val_label.pkl"
  num_classes: 6
  label_map:
    sitting_down: 0
    getting_up: 1
    sitting: 2
    standing: 3
    walking: 4
    jumping: 5
  batch_size: 16
  num_workers: 1
train:
  optim:
    lr: 0.1
    momentum: 0.9
    nesterov: True
    weight_decay: 0.0001
    lr_scheduler: "MultiStep"
    lr_steps:
    - 10
    - 60
    lr_decay: 0.1
  num_epochs: 70
  checkpoint_interval: 5

Parameter	Data Type	Default	Description
`model`	dict config	–	The configuration for the model architecture
`dataset`	dict config	–	The configuration for the dataset
`train`	dict config	–	The configuration for the training process

model

The model parameter provides options to change the PoseClassificationNet architecture.

Copy
Copied!

            
            model:
  model_type: ST-GCN
  pretrained_model_path: "/path/to/pretrained_model.pth"
  input_channels: 3
  dropout: 0.5
  graph_layout: "nvidia"
  graph_strategy: "spatial"
  edge_importance_weighting: True

Parameter	Datatype	Default	Description	Supported Values
`model_type`	string	ST-GCN	The type of model, which can only be ST-GCN for now. Newer architectures will be supported in the future.	ST-GCN
`pretrained_model_path`	string		The path to the pre-trained model
`input_channels`	unsigned int	3	The number of input channels (dimension of body poses)	>0
`dropout`	float	0.5	The probability to drop hidden units	0.0 ~ 1.0
`graph_layout`	string	nvidia	The layout of the graph for modeling skeletons. It can be nvidia, openpose, human3.6m, ntu-rgb+d, ntu_edge, or coco.	nvidia/openpose/human3.6m/ntu-rgb+d/ntu_edge/coco
`graph_strategy`	string	spatial	The strategy of the graph for modeling skeletons. It can be uniform, distance, or spatial.	uniform/distance/spatial
`edge_importance_weighting`	bool	True	Specifies whether to enable edge importance weighting	True/False

dataset

The dataset parameter defines the dataset source, training batch size, and augmentation.

Copy
Copied!

            
            dataset:
  train_dataset:
    data_path: "/path/to/train_data.npy"
    label_path: "/path/to/train_label.pkl"
  val_dataset:
    data_path: "/path/to/val_data.npy"
    label_path: "/path/to/val_label.pkl"
  num_classes: 6
  label_map:
    sitting_down: 0
    getting_up: 1
    sitting: 2
    standing: 3
    walking: 4
    jumping: 5
  batch_size: 16
  num_workers: 1

Parameter	Datatype	Default	Description	Supported Values
`train_dataset`	dict		The data_path to the data in a NumPy array and label_path to the labels in a pickle file for training
`val_dataset`	dict		The data_path to the data in a NumPy array and label_path to the labels in a pickle file for validation
`num_classes`	unsigned int	6	The number of action classes	>0
`label_map`	dict		A dict that maps the class names to indices
`random_choose`	bool	False	Specifies whether to randomly choose a portion of the input sequence.	True/False
`random_move`	bool	False	Specifies whether to randomly move the input sequence.	True/False
`window_size`	unsigned int	-1	The length of the output sequence. A value of -1 specifies the original length.
`batch_size`	unsigned int	64	The batch size for training and validation	>0
`num_workers`	unsigned int	1	The number of parallel workers processing data	>0

Note

The input layout is NCTVM, where N is the batch size, C is the number of input channels, T is the sequence length, V is the number of keypoints, and M is the number of people.

train

The train parameter defines the hyperparameters of the training process.

Copy
Copied!

            
            train:
  optim:
    lr: 0.1
    momentum: 0.9
    nesterov: True
    weight_decay: 0.0001
    lr_scheduler: "MultiStep"
    lr_steps:
    - 10
    - 60
    lr_decay: 0.1
  num_epochs: 70
  checkpoint_interval: 5

Parameter	Datatype	Default	Description	Supported Values
`optim`	dict config		The configuration for the SGD optimizer, including the learning rate, learning scheduler, weight decay, etc.
`num_epochs`	unsigned int	70	The total number of epochs to run the experiment	>0
`checkpoint_interval`	unsigned int	5	The interval at which the checkpoints are saved	>0
`grad_clip`	float	0.0	The amount to clip the gradient by the L2 norm. A value of 0.0 specifies no clipping.	>=0

optim

The optim parameter defines the config for the SGD optimizer in training, including the learning rate, learning scheduler, and weight decay.

Copy
Copied!

            
            optim:
  lr: 0.1
  momentum: 0.9
  nesterov: True
  weight_decay: 0.0001
  lr_scheduler: "MultiStep"
  lr_steps:
  - 10
  - 60
  lr_decay: 0.1

Parameter	Datatype	Default	Description	Supported Values
`lr`	float	0.1	The initial learning rate for the training	>0.0
`momentum`	float	0.9	The momentum for the SGD optimizer	>0.0
`nesterov`	bool	True	Specifies whether to enable Nesterov momentum.	True/False
`weight_decay`	float	1e-4	The weight decay coefficient	>0.0
`lr_scheduler`	string	MultiStep	The learning scheduler. Two schedulers are provided: * `MultiStep` : Decrease the `lr` by `lr_decay` at setting steps. * `AutoReduce` : Decrease the `lr` by `lr_decay` while `lr_monitor` doesn’t decline more than 0.1% of the previous value.	MultiStep/AutoReduce
`lr_monitor`	string	val_loss	The monitor value for the `AutoReduce` scheduler	val_loss/train_loss
`patience`	unsigned int	1	The number of epochs with no improvement, after which learning rate will be reduced	>0
`min_lr`	float	1e-4	The minimum learning rate in the training	>0.0
`lr_steps`	int list	[10, 60]	The steps to decrease the learning rate for the `MultiStep` scheduler	int list
`lr_decay`	float	0.1	The decreasing factor for the learning rate scheduler	>0.0

Training the Model

Use the following command to run PoseClassificationNet training:

Copy
Copied!

            
            tao model pose_classification train -e <experiment_spec_file>
                              -r <results_dir>
                              -k <key>
                              [train.gpu_ids=<gpu id list>]

Required Arguments

-e, --experiment_spec_file: The path to the experiment spec file.
-r, --results_dir: The path to a folder where the experiment outputs should be written.
-k, --key: The user-specific encoding key to save or load a .tlt model.

Optional Arguments

train.gpu_ids: The GPU indices list for training. If you set more than one GPU ID, multi-GPU training will be triggered automatically.

Here’s an example of using the PoseClassificationNet training command:

Copy
Copied!

            
            tao model pose_classification train -e $DEFAULT_SPEC -r $RESULTS_DIR -k $KEY

Note

Pre-trained models are not designed to be re-trained with input data of varying dimensions. ST-GCN is a lightweight network, and leveraging a pre-trained model typically doesn’t significantly affect the final accuracy.

Evaluating the model

The evaluation metric of PoseClassificationNet is the accuracy of action recognition.

Use the following command to run PoseClassificationNet evaluation:

Copy
Copied!

            
            tao model pose_classification evaluate -e <experiment_spec_file>
                                 -r <results_dir>
                                 -k <key>
                                 evaluate.checkpoint=<model to be evaluated>
                                 evaluate.test_dataset.data_path=<path to test data>
                                 evaluate.test_dataset.label_path=<path to test labels>
                                 [evaluate.gpu_id=<gpu index>]

Required Arguments

-e, --experiment_spec_file: The experiment spec file to set up the evaluation experiment.
-r, --results_dir: The path to a folder where the experiment outputs should be written.
-k, --key: The encoding key for the .tlt model.
evaluate.checkpoint: The .tlt model.
evaluate.test_dataset.data_path: The path to the test data.
evaluate.test_dataset.label_path: The path to the test labels.

Optional Argument

evaluate.gpu_id: The GPU index used to run the evaluation. You can specify the GPU index used to run evaluation when the machine has multiple GPUs installed. Note that evaluation can only run on a single GPU.

Here’s an example of using the PoseClassificationNet evaluation command:

Copy
Copied!

            
            tao model pose_classification evaluate -e $DEFAULT_SPEC -r $RESULTS_DIR -k $KEY evaluate.checkpoint=$TRAINED_TLT_MODEL evaluate.test_dataset.data_path=$TEST_DATA evaluate.test_dataset.label_path=$TEST_LABEL

Running Inference on the Model

Use the following command to run inference on PoseClassificationNet with the .tlt model.

Copy
Copied!

            
            tao model pose_classification inference -e <experiment_spec>
                                  -r <results_dir>
                                  -k <key>
                                  inference.checkpoint=<inference model>
                                  inference.output_file=<path to output file>
                                  inference.test_dataset.data_path=<path to inference data>
                                  [inference.gpu_id=<gpu index>]

The output will be a text file, where each line corresponds to the predicted action class for an input sequence.

Required Arguments

-e, --experiment_spec: The experiment spec file to set up inference.
-r, --results_dir: The path to a folder where the experiment outputs should be written.
-k, --key: The encoding key for the .tlt model.
inference.checkpoint: The .tlt model to perform inference with.
inference.output_file: The path to the output text file.
inference.test_dataset.data_path: The path to the test data.

Optional Argument

inference.gpu_id: The GPU index used to run the inference. You can specify the GPU index used to run inference when the machine has multiple GPUs installed. Note that inference can only run on a single GPU.

Here’s an example of using the PoseClassificationNet inference command:

Copy
Copied!

            
            tao model pose_classification inference -e $DEFAULT_SPEC -r $RESULTS_DIR -k $KEY inference.checkpoint=$TRAINED_TLT_MODEL inference.output_file=$OUTPUT_FILE inference.test_dataset.data_path=$TEST_DATA

The expected output for the NVIDIA test data would be as follows:

Copy
Copied!

            
            sit
sit
sit_down
...

Exporting the Model

Use the following command to export PoseClassificationNet to .onnx format for deployment:

Copy
Copied!

            
            tao model pose_classification export -e <experiment_spec>
                               -r <results_dir>
                               -k <key>
                               export.checkpoint=<tlt checkpoint to be exported>
                               [export.onnx_file=<path to exported file>]
                               [export.gpu_id=<gpu index>]

Required Arguments

-e, --experiment_spec: The experiment spec file to set up export.
-r, --results_dir: The path to a folder where the experiment outputs should be written.
-k, --key: The encoding key for the .tlt model.
export.checkpoint: The .tlt model to be exported.

Optional Arguments

export.onnx_file: The path to save the exported model to. The default path is in the same directory as the \*.tlt model.
export.gpu_id: The index of the GPU used to run export. If the machine has multiple GPUs, you can specify the GPU index used to run export. Note that export can only run on a single GPU.

Here’s an example of using the PoseClassificationNet export command:

Copy
Copied!

            
            tao model pose_classification export -e $DEFAULT_SPEC -r $RESULTS_DIR -k $KEY export.checkpoint=$TRAINED_TLT_MODEL

Converting the Pose Data

Use the following command to convert the output JSON metadata from the deepstream-bodypose-3d app and generate spatio-temporal sequences of body poses for inference:

Copy
Copied!

            
            tao model pose_classification dataset_convert -e <experiment_spec>
                                        -r <results_dir>
                                        -k <key>
                                        dataset_convert.data=<path to deepstream-bodypose-3d output data>
                                        [dataset_convert.pose_type=<pose type>]
                                        [dataset_convert.num_joints=<number of joints>]
                                        [dataset_convert.input_width=<input width>]
                                        [dataset_convert.input_height=<input height>]
                                        [dataset_convert.focal_length=<focal length>]
                                        [dataset_convert.sequence_length_max=<maximum sequence length>]
                                        [dataset_convert.sequence_length_min=<minimum sequence length>]
                                        [dataset_convert.sequence_length=<sequence length for sampling>]
                                        [dataset_convert.sequence_overlap=<sequence overlap for sampling>]

Required Arguments

-e, --experiment_spec: The experiment spec file to set up dataset conversion
-r, --results_dir: The path to a folder where the experiment outputs should be written
-k, --key: The encoding key for the .tlt model
dataset_convert.data: The output JSON data from the deepstream-bodypose-3d app

Optional Arguments

dataset_convert.pose_type: The pose type can be chosen from 3dbp, 25dbp, 2dbp
dataset_convert.num_joints: The number of joint points in the graph layout
dataset_convert.input_width: The width of input images in pixels for normalization
dataset_convert.input_height: The height of input images in pixels for normalization
dataset_convert.focal_length: The focal length of the camera for normalization
dataset_convert.sequence_length_max: The maximum sequence length for defining array shape
dataset_convert.sequence_length_min: The minimum sequence length for filtering short sequences
dataset_convert.sequence_length: The general sequence length for sampling
dataset_convert.sequence_overlap: The overlap between sequences for sampling

Here’s an example of using the PoseClassificationNet dataset_convert command:

Copy
Copied!

            
            tao model pose_classification dataset_convert -e $DEFAULT_SPEC -r $RESULTS_DIR -k $KEY dataset_convert.data=$3D_BODYPOSE_JSON

The expected output would be a sampled array for each individual tracked ID saved under the directory for results.

Deploying the Model

You can deploy the trained deep learning and computer-vision models on edge devices, such as a Jetson Xavier, Jetson Nano, Tesla, or in the cloud with NVIDIA GPUs. The exported \*.onnx model can be used in the TAO Toolkit Triton Apps.

Running PoseClassificationNet Inference on the Triton Sample

The TAO Toolkit Triton Apps provide an inference sample for Pose Classification. It consumes a TensorRT engine and supports running with either (1) a NumPy array of skeleton series or (2) output JSON metadata from the deepstream-bodypose-3d app.

To use this sample, you need to generate the TensorRT engine from an \*.onnx model using trtexec, which is described in th next section.

Generating TensorRT Engine Using `trtexec`

For instructions on generating a TensorRT engine using the trtexec command, refer to the trtexec guide for PoseClassificationNet.

Running the Triton Inference Sample

You can generate the TensorRT engine when starting the Triton server using the following command:

Copy
Copied!

            
            bash scripts/start_server.sh

When the server is running, you can get results from a NumPy array of test data with the client using the command mentioned below:

Copy
Copied!

            
            python tao_client.py <path_to_test_data> \
                    -m pose_classification_tao model \
                    -x 1 \
                    -b 1 \
                    --mode Pose_classification \
                    -i https \
                    -u localhost:8000 \
                    --async \
                    --output_path <path_to_output_directory>

Note

The server will perform inference on the input test data. The results are saved as a text file where each line is formatted as [sequence_index], [rank1_pred_score]([rank1_class_index])=[rank1_class_name], [rank2_pred_score]([rank2_class_index])=[rank2_class_name], ..., [rankN_pred_score]([rankN_class_index])=[rankN_class_name]. The expected output for the NVIDIA test data would be as follows:

Copy
Copied!

            
            0, 27.6388(2)=sitting, 12.0806(3)=standing, 7.0409(1)=getting_up, -3.4164(0)=sitting_down, -16.4449(4)=walking, -26.9046(5)=jumping
1, 21.5809(2)=sitting, 8.4994(3)=standing, 5.1917(1)=getting_up, -2.3813(0)=sitting_down, -12.4322(4)=walking, -20.4436(5)=jumping
2, 5.6206(0)=sitting_down, 4.7264(4)=walking, -1.0996(5)=jumping, -2.3501(1)=getting_up, -3.2933(3)=standing, -3.5337(2)=sitting
....

You can also get inference results from the JSON output of the deepstream-bodypose-3d app using the following command:

Copy
Copied!

            
            python tao_client.py <path_to_json_file> \
                    --dataset_convert_config ../dataset_convert_specs/dataset_convert_config_pose_classification.yaml \
                    -m pose_classification_tao model \
                    -x 1 \
                    -b 1 \
                    --mode Pose_classification \
                    -i https \
                    -u localhost:8000 \
                    --async \
                    --output_path <path_to_output_directory>

Note

The server will perform inference on the input JSON file. The results are also saved as a JSON

The skeleton sequence of each object is broken into segments by a dataset converter (refer to the figure below). The sequence_length and sequence_overlap are configurable in dataset_convert_config_pose_classification.yaml. The output labels are assigned to frames after a certain period of time.

End-to-End Inference Using Triton

A sample for end-to-end inference from video is also provided in the TAO Toolkit Triton Apps. The sample runs deepstream-bodypose-3d to generate metadata of bounding boxes, tracked IDs, and 2D/3D poses that are saved in JSON format. The client implicitly converts the metadata into arrays of skeleton sequences and sends them to the Triton server. The predicted action for each sequence is returned and appended to the JSON metadata at corresponding frames. A video with overlaid metadata is also generated for visualization.

You can start the Triton server using the following command (only the Pose Classification model will be downloaded and converted into a TensorRT engine):

Copy
Copied!

            
            bash scripts/pose_cls_e2e_inference/start_server.sh

Once the Triton server has started, open up another terminal and run the following command to begin body pose estimation using DeepStream and run Pose Classification on the DeepStream output using the Triton server instance that you previously spun up:

Copy
Copied!

            
            bash scripts/pose_cls_e2e_inference/start_client.sh

optim

Generating TensorRT Engine Using trtexec

Running the Triton Inference Sample

End-to-End Inference Using Triton

Generating TensorRT Engine Using `trtexec`