PoseClassificationNet#

PoseClassificationNet takes a sequence of skeletons (body poses) as network input and predicts the actions of one or more persons in those frames. The model supported in the current version is based on the spatial-temporal graph convolutional network (ST-GCN), which is the most commonly used baseline for skeleton-based action recognition due to its simplicity and computational efficiency. Unlike pixel-based action recognition, ST-GCN is able to exploit the local pattern and correlation from a spatial-temporal graph of human skeletons. This model can be used to train graph convolutional networks (GCNs) for other purposes through transfer learning. Newer architectures with state-of-the-art performance will be released in the future. TAO provides the network backbone for 3D poses.

Note

  • Throughout this documentation, you will see references to $EXPERIMENT_ID and $DATASET_ID in the FTMS Client sections.

    • For instructions on creating a dataset using the remote client, see the Creating a dataset section in the Remote Client documentation.

    • For instructions on creating an experiment using the remote client, see the Creating an experiment section in the Remote Client documentation.

  • The spec format is YAML for TAO Launcher and JSON for FTMS Client.

  • File-related parameters, such as dataset paths or pretrained model paths, are required only for TAO Launcher and not for FTMS Client.

Preparing the Dataset#

PoseClassificationNet requires a sequence of skeletons (body poses) for input. The coordinates need to be normalized. For example, 3D joints are produced relative to the root keypoint (i.e. pelvis) and normalized by the focal length (1200.0 for 1080P). The entrypoint for dataset conversion generates an array of spatio-temporal sequences based on the output JSON metadata from the deepstream-bodypose-3d app.

The input data for training or inference are formatted as a NumPy array in five dimensions (N, C, T, V, M):

  • N: The number of sequences

  • C: The number of input channels, which is set to 3 in the NGC model

  • T: The maximum sequence length in frames, which is 300 (10 seconds for 30 FPS) in the NGC model

  • V: The number of joint points, set to 34 for the NVIDIA format

  • M: The number of persons. The pre-trained model assumes a single object, but it can also support multiple people

The output of model inference is an array of N elements that gives the predicted action class for each sequence.

The labels used for training or evaluation are stored as a pickle file that consists of a list of two lists, including N elements each. The first list contains N strings of sample names. The second list contains the labeled action class ID of each sequence. The following is an example:

[["xl6vmD0XBS0.json", "OkLnSMGCWSw.json", "IBopZFDKfYk.json", "HpoFylcrYT4.json", "mlAtn_zi0bY.json", ...], [235, 388, 326, 306, 105, ...]]

The graph to model skeletons is defined by two configuration parameters:

  • graph_layout (string): Must be one the following candidates:

    • nvidia consists of 34 joints. For more information, please refer to AR SDK Programming Guide.

    • openpose consists of 18 joints. For more information, please refer to OpenPose.

    • human3.6m consists of 17 joints. For more information, please refer to Human3.6M.

    • ntu-rgb+d consists of 25 joints. For more information, please refer to NTU RGB+D.

    • ntu_edge consists of 24 joints. For more information, please refer to NTU RGB+D.

    • coco consists of 17 joints. For more information, please refer to COCO.

  • graph_strategy (string): Must be one of the following candidates (for more information, refer to the “Partition Strategies” section in this paper):

    • uniform: Uniform Labeling

    • distance: Distance Partitioning

    • spatial: Spatial Configuration

Note

All-in-one scripts are provided for processing Kinetics and self-annotated NVIDIA datasets. The preprocessed data and labels of the NVIDIA dataset can be accessed here.

Creating an Experiment Spec File#

The spec file for PoseClassificationNet includes model, dataset, and train parameters. Here is an example spec for training a 3D-pose-based model on the NVIDIA dataset. It contains six classes: “sitting_down”, “getting_up”, “sitting”, “standing”, “walking”, “jumping”:

Use the following command to get an experiment spec file for PoseClassificationNet:

SPECS=$(tao-client pose_classification get-spec --action train --job_type experiment --id $EXPERIMENT_ID)

Parameter

Data Type

Default

Description

Supported Values

model

dict config

The configuration of the model architecture

dataset

dict config

The configuration of the dataset

train

dict config

The configuration of the training task

evaluate

dict config

The configuration of the evaluation task

inference

dict config

The configuration of the inference task

encryption_key

string

None

The encryption key to encrypt and decrypt model files

results_dir

string

/results

The directory where experiment results are saved

export

dict config

The configuration of the ONNX export task

gen_trt_engine

dict config

The configuration of the TensorRT generation task. Only used in TAO deploy

model#

The model parameter provides options to change the PoseClassificationNet architecture.

model:
  model_type: ST-GCN
  pretrained_model_path: "/path/to/pretrained_model.pth"
  input_channels: 3
  dropout: 0.5
  graph_layout: "nvidia"
  graph_strategy: "spatial"
  edge_importance_weighting: True

Parameter

Datatype

Default

Description

Supported Values

model_type

string

ST-GCN

The type of model, which can only be ST-GCN for now. Newer architectures will be supported in the future.

ST-GCN

pretrained_model_path

string

The path to the pre-trained model

input_channels

unsigned int

3

The number of input channels (dimension of body poses)

>0

dropout

float

0.5

The probability to drop hidden units

0.0 ~ 1.0

graph_layout

string

nvidia

The layout of the graph for modeling skeletons. It can be nvidia, openpose, human3.6m, ntu-rgb+d, ntu_edge, or coco.

nvidia/openpose/human3.6m/ntu-rgb+d/ntu_edge/coco

graph_strategy

string

spatial

The strategy of the graph for modeling skeletons. It can be uniform, distance, or spatial.

uniform/distance/spatial

edge_importance_weighting

bool

True

Specifies whether to enable edge importance weighting

True/False

dataset#

The dataset parameter defines the dataset source, training batch size, and augmentation.

dataset:
  train_dataset:
    data_path: "/path/to/train_data.npy"
    label_path: "/path/to/train_label.pkl"
  val_dataset:
    data_path: "/path/to/val_data.npy"
    label_path: "/path/to/val_label.pkl"
  num_classes: 6
  label_map:
    sitting_down: 0
    getting_up: 1
    sitting: 2
    standing: 3
    walking: 4
    jumping: 5
  batch_size: 16
  num_workers: 1

Parameter

Datatype

Default

Description

Supported Values

train_dataset

dict

The data_path to the data in a NumPy array and label_path to the labels in a pickle file for training

val_dataset

dict

The data_path to the data in a NumPy array and label_path to the labels in a pickle file for validation

num_classes

unsigned int

6

The number of action classes

>0

label_map

dict

A dict that maps the class names to indices

random_choose

bool

False

Specifies whether to randomly choose a portion of the input sequence.

True/False

random_move

bool

False

Specifies whether to randomly move the input sequence.

True/False

window_size

unsigned int

-1

The length of the output sequence. A value of -1 specifies the original length.

batch_size

unsigned int

64

The batch size for training and validation

>0

num_workers

unsigned int

1

The number of parallel workers processing data

>0

Note

The input layout is NCTVM, where N is the batch size, C is the number of input channels, T is the sequence length, V is the number of keypoints, and M is the number of people.

train#

The train parameter defines the hyperparameters of the training process.

train:
  optim:
    lr: 0.1
    momentum: 0.9
    nesterov: True
    weight_decay: 0.0001
    lr_scheduler: "MultiStep"
    lr_steps:
    - 10
    - 60
    lr_decay: 0.1
  num_epochs: 10
  checkpoint_interval: 5
  validation_interval: 5
  seed: 1234

Parameter

Datatype

Default

Description

Supported Values

num_gpus

unsigned int

1

The number of GPUs to use for distributed training

>0

gpu_ids

List[int]

[0]

The indices of the GPU’s to use for distributed training

seed

unsigned int

1234

The random seed for random, NumPy, and torch

>0

num_epochs

unsigned int

10

The total number of epochs to run the experiment

>0

checkpoint_interval

unsigned int

1

The epoch interval at which the checkpoints are saved

>0

validation_interval

unsigned int

1

The epoch interval at which the validation is run

>0

resume_training_checkpoint_path

string

The intermediate PyTorch Lightning checkpoint to resume training from

results_dir

string

/results/train

The directory to save training results

optim

dict config

The configuration for the SGD optimizer, including the learning rate, learning scheduler, weight decay, etc.

grad_clip

float

0.0

The amount to clip the gradient by the L2 norm. A value of 0.0 specifies no clipping.

>=0

optim#

The optim parameter defines the config for the SGD optimizer in training, including the learning rate, learning scheduler, and weight decay.

optim:
  lr: 0.1
  momentum: 0.9
  nesterov: True
  weight_decay: 0.0001
  lr_scheduler: "MultiStep"
  lr_steps:
  - 10
  - 60
  lr_decay: 0.1

Parameter

Datatype

Default

Description

Supported Values

lr

float

0.1

The initial learning rate for the training

>0.0

momentum

float

0.9

The momentum for the SGD optimizer

>0.0

nesterov

bool

True

Specifies whether to enable Nesterov momentum.

True/False

weight_decay

float

1e-4

The weight decay coefficient

>0.0

lr_scheduler


string


MultiStep


The learning scheduler. Two schedulers are provided:
* MultiStep : Decrease the lr by lr_decay at setting steps.
* AutoReduce : Decrease the lr by lr_decay while lr_monitor doesn’t decline more than 0.1% of the previous value.
MultiStep/AutoReduce


lr_monitor

string

val_loss

The monitor value for the AutoReduce scheduler

val_loss/train_loss

patience

unsigned int

1

The number of epochs with no improvement, after which learning rate will be reduced

>0

min_lr

float

1e-4

The minimum learning rate in the training

>0.0

lr_steps

int list

[10, 60]

The steps to decrease the learning rate for the MultiStep scheduler

int list

lr_decay

float

0.1

The decreasing factor for the learning rate scheduler

>0.0

Training the Model#

Use the following command to run PoseClassificationNet training:

TRAIN_JOB_ID=$(tao-client pose_classification experiment-run-action --action train --id $EXPERIMENT_ID --specs "$SPECS")

Checkpointing and Resuming Training

At every train.checkpoint_interval, a PyTorch Lightning checkpoint is saved. It is called model_epoch_<epoch_num>.pth. These are saved in train.results_dir, like so:

$ ls /results/train

'model_epoch_000.pth'
'model_epoch_001.pth'
'model_epoch_002.pth'
'model_epoch_003.pth'
'model_epoch_004.pth'

The latest checkpoint is saved as pc_model_latest.pth. Training automatically resumes from pc_model_latest.pth, if it exists in train.results_dir. This will be superseded by train.resume_training_checkpoint_path, if it is provided.

The major implication of this logic is that, if you wish to trigger fresh training from scratch, either:

  • Specify a new, empty results directory (Recommended)

  • Remove the latest checkpoint from the results directory

Note

Pre-trained models are not designed to be re-trained with input data of varying dimensions. ST-GCN is a lightweight network, and leveraging a pre-trained model typically doesn’t significantly affect the final accuracy.

Evaluating the Model#

The evaluation metric of PoseClassificationNet is the accuracy of action recognition.

Use the following command to run PoseClassificationNet evaluation:

EVAL_JOB_ID=$(tao-client pose_classification experiment-run-action --action evaluate --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")

Multi-GPU evaluation is currently not supported for Pose Classification.

Running Inference on the Model#

Use the following command to run inference on PoseClassificationNet.

INFERENCE_JOB_ID=$(tao-client pose_classification experiment-run-action --action inference --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")

The output will be a text file, where each line corresponds to the predicted action class for an input sequence.

Multi-GPU inference is currently not supported for Pose Classification.

The expected output for the NVIDIA test data would be as follows:

sit
sit
sit_down
...

Exporting the Model#

Use the following command to export PoseClassificationNet to .onnx format for deployment:

EXPORT_JOB_ID=$(tao-client pose_classification experiment-run-action --action export --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")

Converting the Pose Data#

Use the following command to convert the output JSON metadata from the deepstream-bodypose-3d app and generate spatio-temporal sequences of body poses for inference:

DS_CONVERT_JOB_ID=$(tao-client pose_classification dataset-run-action --action dataset_convert --id $DATASET_ID --specs "$SPECS")

The expected output would be a sampled array for each individual tracked ID saved under the directory for results.

Deploying the Model#

You can deploy the trained deep learning and computer-vision models on edge devices, such as a Jetson Xavier, Jetson Nano, Tesla, or in the cloud with NVIDIA GPUs. The exported *.onnx model can be used in the TAO Triton Apps.

Running PoseClassificationNet Inference on the Triton Sample#

The TAO Triton Apps provide an inference sample for Pose Classification. It consumes a TensorRT engine and supports running with either (1) a NumPy array of skeleton series or (2) output JSON metadata from the deepstream-bodypose-3d app.

To use this sample, you need to generate the TensorRT engine from an *.onnx model using trtexec, which is described in th next section.

Generating TensorRT Engine Using trtexec#

For instructions on generating a TensorRT engine using the trtexec command, refer to the trtexec guide for PoseClassificationNet.

Running the Triton Inference Sample#

You can generate the TensorRT engine when starting the Triton server using the following command:

bash scripts/start_server.sh

When the server is running, you can get results from a NumPy array of test data with the client using the command mentioned below:

python tao_client.py <path_to_test_data> \
                    -m pose_classification_tao model \
                    -x 1 \
                    -b 1 \
                    --mode Pose_classification \
                    -i https \
                    -u localhost:8000 \
                    --async \
                    --output_path <path_to_output_directory>

Note

The server performs inference on the input test data. The results are saved as a text file where each line is formatted as [sequence_index], [rank1_pred_score]([rank1_class_index])=[rank1_class_name], [rank2_pred_score]([rank2_class_index])=[rank2_class_name], ..., [rankN_pred_score]([rankN_class_index])=[rankN_class_name]. The expected output for the NVIDIA test data would be as follows:

0, 27.6388(2)=sitting, 12.0806(3)=standing, 7.0409(1)=getting_up, -3.4164(0)=sitting_down, -16.4449(4)=walking, -26.9046(5)=jumping
1, 21.5809(2)=sitting, 8.4994(3)=standing, 5.1917(1)=getting_up, -2.3813(0)=sitting_down, -12.4322(4)=walking, -20.4436(5)=jumping
2, 5.6206(0)=sitting_down, 4.7264(4)=walking, -1.0996(5)=jumping, -2.3501(1)=getting_up, -3.2933(3)=standing, -3.5337(2)=sitting
....

You can also get inference results from the JSON output of the deepstream-bodypose-3d app using the following command:

python tao_client.py <path_to_json_file> \
                    --dataset_convert_config ../dataset_convert_specs/dataset_convert_config_pose_classification.yaml \
                    -m pose_classification_tao model \
                    -x 1 \
                    -b 1 \
                    --mode Pose_classification \
                    -i https \
                    -u localhost:8000 \
                    --async \
                    --output_path <path_to_output_directory>

Note

The server performs inference on the input JSON file. The results are also saved as a JSON

file, which follows the same format as the input and adds the predicted "action" to each object at each frame. A sample of the JSON output would be as follows:

[
  ...,
  {
    "batches": [{
        "batch_id": 0,
        "frame_num": 120,
        "ntp_timestamp": 1651865934597373000,
        "objects": [{
            "action": "sitting",
            "bbox": [1058.529785, 566.782471, 223.130005, 341.585083],
            "object_id": 3,
            "pose25d": [1179.673828, 815.848633, -8.2e-05, 0.48291, 219.287964, 814.737305, -0.016891, 0.357422,...],
            "pose3d": [692.748474, 869.897461, 3784.238281, 0.48291, 815.966187, 864.584229, 3776.338867, 0.357422,...]
          }, {
            "action": "standing",
            "bbox": [1652.608154, 29.364517, 151.506958, 285.322723],
            "object_id": 5,
            "pose25d": [1730.824219, 166.724609, -0.000231, 0.827148, 1745.931641, 171.605469, -0.092529, 0.803711,...],
            "pose3d": [4327.349121, -2095.539795, 6736.708984, 0.827148, 4384.155762, -2055.012207, 6693.950195, 0.803711,...]
          },...
        ]
    }],
    "num_frames_in_batch": 1
  },...
]

The skeleton sequence of each object is broken into segments by a dataset converter (refer to the figure below). The sequence_length and sequence_overlap are configurable in dataset_convert_config_pose_classification.yaml. The output labels are assigned to frames after a certain period of time.

../../../../_images/pose_cls_e2e_inference.png

End-to-End Inference Using Triton#

A sample for end-to-end inference from video is also provided in the TAO Triton Apps. The sample runs deepstream-bodypose-3d to generate metadata of bounding boxes, tracked IDs, and 2D/3D poses that are saved in JSON format. The client implicitly converts the metadata into arrays of skeleton sequences and sends them to the Triton server. The predicted action for each sequence is returned and appended to the JSON metadata at corresponding frames. A video with overlaid metadata is also generated for visualization.

You can start the Triton server using the following command (only the Pose Classification model will be downloaded and converted into a TensorRT engine):

bash scripts/pose_cls_e2e_inference/start_server.sh

Once the Triton server has started, open up another terminal and run the following command to begin body pose estimation using DeepStream and run Pose Classification on the DeepStream output using the Triton server instance that you previously spun up:

bash scripts/pose_cls_e2e_inference/start_client.sh