PoseClassificationNet
PoseClassificationNet takes a sequence of skeletons (body poses) as network input and predicts the actions of one or more persons in those frames. The model supported in the current version is based on the spatial-temporal graph convolutional network (ST-GCN), which is the most commonly used baseline for skeleton-based action recognition due to its simplicity and computational efficiency. Unlike pixel-based action recognition, ST-GCN is able to exploit the local pattern and correlation from a spatial-temporal graph of human skeletons. This model can be used to train graph convolutional networks (GCNs) for other purposes through transfer learning. Newer architectures with state-of-the-art performance will be released in the future. TAO provides the network backbone for 3D poses.
PoseClassificationNet requires a sequence of skeletons (body poses) for input. The coordinates need to be normalized. For example, 3D joints are produced relative to the root keypoint (i.e. pelvis) and normalized by the focal length (1200.0 for 1080P). The entrypoint for dataset conversion generates an array of spatio-temporal sequences based on the output JSON metadata from the deepstream-bodypose-3d app.
The input data for training or inference are formatted as a NumPy array in five dimensions
(N, C, T, V, M)
:
N
: The number of sequencesC
: The number of input channels, which is set to 3 in the NGC modelT
: The maximum sequence length in frames, which is 300 (10 seconds for 30 FPS) in the NGC modelV
: The number of joint points, set to 34 for the NVIDIA formatM
: The number of persons. The pre-trained model assumes a single object, but it can also support multiple people
The output of model inference is an array of N
elements that gives the predicted action class
for each sequence.
The labels used for training or evaluation are stored as a pickle file that consists of a list of two
lists, including N
elements each. The first list contains N
strings of sample names.
The second list contains the labeled action class ID of each sequence. The following is an example:
[["xl6vmD0XBS0.json", "OkLnSMGCWSw.json", "IBopZFDKfYk.json", "HpoFylcrYT4.json", "mlAtn_zi0bY.json", ...], [235, 388, 326, 306, 105, ...]]
The graph to model skeletons is defined by two configuration parameters:
graph_layout
(string): Must be one the following candidates:nvidia
consists of 34 joints. For more information, please refer to AR SDK Programming Guide.openpose
consists of 18 joints. For more information, please refer to OpenPose.human3.6m
consists of 17 joints. For more information, please refer to Human3.6M.ntu-rgb+d
consists of 25 joints. For more information, please refer to NTU RGB+D.ntu_edge
consists of 24 joints. For more information, please refer to NTU RGB+D.coco
consists of 17 joints. For more information, please refer to COCO.
graph_strategy
(string): Must be one of the following candidates (for more information, refer to the “Partition Strategies” section in this paper):uniform
: Uniform Labelingdistance
: Distance Partitioningspatial
: Spatial Configuration
The spec file for PoseClassificationNet includes model
, dataset
, and
train
parameters. Here is an example spec for training a 3D-pose-based model on
the NVIDIA dataset. It contains six classes: “sitting_down”, “getting_up”, “sitting”, “standing”,
“walking”, “jumping”:
model:
model_type: ST-GCN
pretrained_model_path: "/path/to/pretrained_model.pth"
input_channels: 3
dropout: 0.5
graph_layout: "nvidia"
graph_strategy: "spatial"
edge_importance_weighting: True
dataset:
train_dataset:
data_path: "/path/to/train_data.npy"
label_path: "/path/to/train_label.pkl"
val_dataset:
data_path: "/path/to/val_data.npy"
label_path: "/path/to/val_label.pkl"
num_classes: 6
label_map:
sitting_down: 0
getting_up: 1
sitting: 2
standing: 3
walking: 4
jumping: 5
batch_size: 16
num_workers: 1
train:
optim:
lr: 0.1
momentum: 0.9
nesterov: True
weight_decay: 0.0001
lr_scheduler: "MultiStep"
lr_steps:
- 10
- 60
lr_decay: 0.1
num_epochs: 10
checkpoint_interval: 5
validation_interval: 5
seed: 1234
Parameter | Data Type | Default | Description | Supported Values |
model |
dict config | – | The configuration of the model architecture | |
dataset |
dict config | – | The configuration of the dataset | |
train |
dict config | – | The configuration of the training task | |
evaluate |
dict config | – | The configuration of the evaluation task | |
inference |
dict config | – | The configuration of the inference task | |
encryption_key |
string | None | The encryption key to encrypt and decrypt model files | |
results_dir |
string | /results | The directory where experiment results are saved | |
export |
dict config | – | The configuration of the ONNX export task | |
gen_trt_engine |
dict config | – | The configuration of the TensorRT generation task. Only used in TAO deploy |
model
The model
parameter provides options to change the PoseClassificationNet architecture.
model:
model_type: ST-GCN
pretrained_model_path: "/path/to/pretrained_model.pth"
input_channels: 3
dropout: 0.5
graph_layout: "nvidia"
graph_strategy: "spatial"
edge_importance_weighting: True
Parameter | Datatype | Default | Description | Supported Values |
model_type |
string | ST-GCN | The type of model, which can only be ST-GCN for now. Newer architectures will be supported in the future. | ST-GCN |
pretrained_model_path |
string | The path to the pre-trained model | ||
input_channels |
unsigned int | 3 | The number of input channels (dimension of body poses) | >0 |
dropout |
float | 0.5 | The probability to drop hidden units | 0.0 ~ 1.0 |
graph_layout |
string | nvidia | The layout of the graph for modeling skeletons. It can be nvidia, openpose, human3.6m, ntu-rgb+d, ntu_edge, or coco. | nvidia/openpose/human3.6m/ntu-rgb+d/ntu_edge/coco |
graph_strategy |
string | spatial | The strategy of the graph for modeling skeletons. It can be uniform, distance, or spatial. | uniform/distance/spatial |
edge_importance_weighting |
bool | True | Specifies whether to enable edge importance weighting | True/False |
dataset
The dataset
parameter defines the dataset source, training batch size, and augmentation.
dataset:
train_dataset:
data_path: "/path/to/train_data.npy"
label_path: "/path/to/train_label.pkl"
val_dataset:
data_path: "/path/to/val_data.npy"
label_path: "/path/to/val_label.pkl"
num_classes: 6
label_map:
sitting_down: 0
getting_up: 1
sitting: 2
standing: 3
walking: 4
jumping: 5
batch_size: 16
num_workers: 1
Parameter | Datatype | Default | Description | Supported Values |
train_dataset |
dict | The data_path to the data in a NumPy array and label_path to the labels in a pickle file for training | ||
val_dataset |
dict | The data_path to the data in a NumPy array and label_path to the labels in a pickle file for validation | ||
num_classes |
unsigned int | 6 | The number of action classes | >0 |
label_map |
dict | A dict that maps the class names to indices | ||
random_choose |
bool | False | Specifies whether to randomly choose a portion of the input sequence. | True/False |
random_move |
bool | False | Specifies whether to randomly move the input sequence. | True/False |
window_size |
unsigned int | -1 | The length of the output sequence. A value of -1 specifies the original length. | |
batch_size |
unsigned int | 64 | The batch size for training and validation | >0 |
num_workers |
unsigned int | 1 | The number of parallel workers processing data | >0 |
The input layout is NCTVM
, where N
is the batch size, C
is the number of
input channels, T
is the sequence length, V
is the number of keypoints, and
M
is the number of people.
train
The train
parameter defines the hyperparameters of the training process.
train:
optim:
lr: 0.1
momentum: 0.9
nesterov: True
weight_decay: 0.0001
lr_scheduler: "MultiStep"
lr_steps:
- 10
- 60
lr_decay: 0.1
num_epochs: 10
checkpoint_interval: 5
validation_interval: 5
seed: 1234
Parameter | Datatype | Default | Description | Supported Values |
num_gpus |
unsigned int | 1 | The number of GPUs to use for distributed training | >0 |
gpu_ids |
List[int] | [0] | The indices of the GPU’s to use for distributed training | |
seed |
unsigned int | 1234 | The random seed for random, NumPy, and torch | >0 |
num_epochs |
unsigned int | 10 | The total number of epochs to run the experiment | >0 |
checkpoint_interval |
unsigned int | 1 | The epoch interval at which the checkpoints are saved | >0 |
validation_interval |
unsigned int | 1 | The epoch interval at which the validation is run | >0 |
resume_training_checkpoint_path |
string | The intermediate PyTorch Lightning checkpoint to resume training from | ||
results_dir |
string | /results/train | The directory to save training results | |
optim |
dict config | The configuration for the SGD optimizer, including the learning rate, learning scheduler, weight decay, etc. | ||
grad_clip |
float | 0.0 | The amount to clip the gradient by the L2 norm. A value of 0.0 specifies no clipping. | >=0 |
optim
The optim
parameter defines the config for the SGD optimizer in training, including the
learning rate, learning scheduler, and weight decay.
optim:
lr: 0.1
momentum: 0.9
nesterov: True
weight_decay: 0.0001
lr_scheduler: "MultiStep"
lr_steps:
- 10
- 60
lr_decay: 0.1
Parameter |
Datatype |
Default |
Description |
Supported Values |
---|---|---|---|---|
lr |
float | 0.1 | The initial learning rate for the training | >0.0 |
momentum |
float | 0.9 | The momentum for the SGD optimizer | >0.0 |
nesterov |
bool | True | Specifies whether to enable Nesterov momentum. | True/False |
weight_decay |
float | 1e-4 | The weight decay coefficient | >0.0 |
|
string |
MultiStep |
The learning scheduler. Two schedulers are provided: |
MultiStep/AutoReduce |
lr_monitor |
string | val_loss | The monitor value for the AutoReduce scheduler |
val_loss/train_loss |
patience |
unsigned int | 1 | The number of epochs with no improvement, after which learning rate will be reduced | >0 |
min_lr |
float | 1e-4 | The minimum learning rate in the training | >0.0 |
lr_steps |
int list | [10, 60] | The steps to decrease the learning rate for the MultiStep scheduler |
int list |
lr_decay |
float | 0.1 | The decreasing factor for the learning rate scheduler | >0.0 |
Use the following command to run PoseClassificationNet training:
tao model pose_classification train [-h] -e <experiment_spec>
[results_dir=<global_results_dir>]
[model.<model_option>=<model_option_value>]
[dataset.<dataset_option>=<dataset_option_value>]
[train.<train_option>=<train_option_value>]
[train.gpu_ids=<gpu indices>]
[train.num_gpus=<number of gpus>]
Required Arguments
The only required argument is the path to the experiment spec:
-e, --experiment_spec
: The experiment specification file to set up the training experiment
Optional Arguments
You can set optional arguments to override the option values in the experiment spec file.
-h, --help
: Show this help message and exit.model.<model_option>
: The model options.dataset.<dataset_option>
: The dataset options.train.<train_option>
: The train options.train.optim.<optim_option>
: The optimizer options
For training, evaluation, and inference, we expose 2 variables for each respective task: num_gpus
and gpu_ids
, which
default to 1
and [0]
, respectively. If both are passed, but inconsistent, for example num_gpus = 1
,
gpu_ids = [0, 1]
, then they are modified to follow the setting with more GPUs, for example num_gpus = 1 -> num_gpus = 2
.
Checkpointing and Resuming Training
At every train.checkpoint_interval
, a PyTorch Lightning checkpoint is saved. It is called model_epoch_<epoch_num>.pth
.
These are saved in train.results_dir
, like so:
$ ls /results/train
'model_epoch_000.pth'
'model_epoch_001.pth'
'model_epoch_002.pth'
'model_epoch_003.pth'
'model_epoch_004.pth'
The latest checkpoint is saved as pc_model_latest.pth
.
Training automatically resumes from pc_model_latest.pth
, if it exists in train.results_dir
.
This will be superseded by train.resume_training_checkpoint_path
, if it is provided.
The major implication of this logic is that, if you wish to trigger fresh training from scratch, either:
Specify a new, empty results directory (Recommended)
Remove the latest checkpoint from the results directory
Pre-trained models are not designed to be re-trained with input data of varying dimensions.
ST-GCN
is a lightweight network, and leveraging a pre-trained model typically doesn’t significantly affect the final accuracy.
The evaluation metric of PoseClassificationNet is the accuracy of action recognition.
Use the following command to run PoseClassificationNet evaluation:
tao model pose_classification evaluate [-h] -e <experiment_spec_file>
evaluate.checkpoint=<model to be evaluated>
evaluate.test_dataset.data_path=<path to test data>
evaluate.test_dataset.label_path=<path to test labels>
[evaluate.<evaluate_option>=<evaluate_option_value>]
[evaluate.gpu_ids=<gpu indices>]
[evaluate.num_gpus=<number of gpus>]
Multi-GPU evaluation is currently not supported for Pose Classification.
Required Arguments
-e, --experiment_spec
: The experiment spec file to set up the evaluation experiment.evaluate.checkpoint
: The.pth
model to be evaluated.evaluate.test_dataset.data_path
: The path to the test data.evaluate.test_dataset.label_path
: The path to the test labels.
Optional Arguments
evaluate.gpu_ids
: The GPU indices to run evaluation. Defaults to[0]
.evaluate.num_gpus
: The number of GPUs to run evaluation. Defualts to1
.evaluate.results_dir
: The directory to save the evaluation results. Defaults to/results/evaluate
.
Use the following command to run inference on PoseClassificationNet.
tao model pose_classification inference [-h] -e <experiment_spec>
inference.checkpoint=<inference model>
inference.output_file=<path to output file>
inference.test_dataset.data_path=<path to inference data>
[inference.<infer_option>=<infer_option_value>]
[inference.gpu_ids=<gpu indices>]
[inference.num_gpus=<number of gpus>]
The output will be a text file, where each line corresponds to the predicted action class for an input sequence.
Multi-GPU inference is currently not supported for Pose Classification.
Required Arguments
-e, --experiment_spec
: The experiment spec file to set up the inference experiment.inference.checkpoint
: The.pth
model to inference.inference.output_file
: The path to the output text file.inference.test_dataset.data_path
: The path to the test data.
Optional Arguments
inference.gpu_ids
: The GPU indices to run inference. Defaults to[0]
.inference.num_gpus
: The number of GPUs to run inference. Defualts to1
.inference.results_dir
: The directory to save the inference results. Defaults to/results/inference
.
The expected output for the NVIDIA test data would be as follows:
sit
sit
sit_down
...
Use the following command to export PoseClassificationNet to .onnx
format for deployment:
tao model pose_classification export -e <experiment_spec>
export.checkpoint=<tlt checkpoint to be exported>
export.onnx_file=<path to exported file>
[export.gpu_id=<gpu index>]
Required Arguments
-e, --experiment_spec
: The path to an experiment spec file.export.checkpoint
: The.pth
model to export.export.onnx_file
: The path where the.etlt
or.onnx
model is saved.
Optional Arguments
export.gpu_id
: The index of the GPU used to run export. If the machine has multiple GPUs, you can specify the GPU index used to run export. Note that export can only run on a single GPU.
Use the following command to convert the output JSON metadata from the deepstream-bodypose-3d app and generate spatio-temporal sequences of body poses for inference:
tao model pose_classification dataset_convert -e <experiment_spec>
dataset_convert.data=<path to deepstream-bodypose-3d output data>
[dataset_convert.pose_type=<pose type>]
[dataset_convert.num_joints=<number of joints>]
[dataset_convert.input_width=<input width>]
[dataset_convert.input_height=<input height>]
[dataset_convert.focal_length=<focal length>]
[dataset_convert.sequence_length_max=<maximum sequence length>]
[dataset_convert.sequence_length_min=<minimum sequence length>]
[dataset_convert.sequence_length=<sequence length for sampling>]
[dataset_convert.sequence_overlap=<sequence overlap for sampling>]
Required Arguments
-e, --experiment_spec
: The experiment spec file to set up dataset conversiondataset_convert.data
: The output JSON data from the deepstream-bodypose-3d app
Optional Arguments
dataset_convert.results_dir
: The path to a folder where the experiment outputs should be writtendataset_convert.pose_type
: The pose type can be chosen from 3dbp, 25dbp, 2dbpdataset_convert.num_joints
: The number of joint points in the graph layoutdataset_convert.input_width
: The width of input images in pixels for normalizationdataset_convert.input_height
: The height of input images in pixels for normalizationdataset_convert.focal_length
: The focal length of the camera for normalizationdataset_convert.sequence_length_max
: The maximum sequence length for defining array shapedataset_convert.sequence_length_min
: The minimum sequence length for filtering short sequencesdataset_convert.sequence_length
: The general sequence length for samplingdataset_convert.sequence_overlap
: The overlap between sequences for sampling
The expected output would be a sampled array for each individual tracked ID saved under the directory for results.
You can deploy the trained deep learning and computer-vision models on edge devices,
such as a Jetson Xavier, Jetson Nano, Tesla, or in the cloud with NVIDIA GPUs. The exported
\*.onnx
model can be used in the TAO Triton Apps.
Running PoseClassificationNet Inference on the Triton Sample
The TAO Triton Apps provide an inference sample for Pose Classification. It consumes a TensorRT engine and supports running with either (1) a NumPy array of skeleton series or (2) output JSON metadata from the deepstream-bodypose-3d app.
To use this sample, you need to generate the TensorRT engine from an \*.onnx
model using
trtexec
, which is described in th next section.
Generating TensorRT Engine Using trtexec
For instructions on generating a TensorRT engine using the trtexec
command, refer to the
trtexec guide for PoseClassificationNet.
Running the Triton Inference Sample
You can generate the TensorRT engine when starting the Triton server using the following command:
bash scripts/start_server.sh
When the server is running, you can get results from a NumPy array of test data with the client using the command mentioned below:
python tao_client.py <path_to_test_data> \
-m pose_classification_tao model \
-x 1 \
-b 1 \
--mode Pose_classification \
-i https \
-u localhost:8000 \
--async \
--output_path <path_to_output_directory>
The server performs inference on the input test data. The results are saved as a text file
where each line is formatted as
[sequence_index], [rank1_pred_score]([rank1_class_index])=[rank1_class_name], [rank2_pred_score]([rank2_class_index])=[rank2_class_name], ..., [rankN_pred_score]([rankN_class_index])=[rankN_class_name]
.
The expected output for the NVIDIA test data would be as follows:
0, 27.6388(2)=sitting, 12.0806(3)=standing, 7.0409(1)=getting_up, -3.4164(0)=sitting_down, -16.4449(4)=walking, -26.9046(5)=jumping
1, 21.5809(2)=sitting, 8.4994(3)=standing, 5.1917(1)=getting_up, -2.3813(0)=sitting_down, -12.4322(4)=walking, -20.4436(5)=jumping
2, 5.6206(0)=sitting_down, 4.7264(4)=walking, -1.0996(5)=jumping, -2.3501(1)=getting_up, -3.2933(3)=standing, -3.5337(2)=sitting
....
You can also get inference results from the JSON output of the deepstream-bodypose-3d app using the following command:
python tao_client.py <path_to_json_file> \
--dataset_convert_config ../dataset_convert_specs/dataset_convert_config_pose_classification.yaml \
-m pose_classification_tao model \
-x 1 \
-b 1 \
--mode Pose_classification \
-i https \
-u localhost:8000 \
--async \
--output_path <path_to_output_directory>
- The server performs inference on the input JSON file. The results are also saved as a JSON
file, which follows the same format as the input and adds the predicted
"action"
to each object at each frame. A sample of the JSON output would be as follows:[ ..., { "batches": [{ "batch_id": 0, "frame_num": 120, "ntp_timestamp": 1651865934597373000, "objects": [{ "action": "sitting", "bbox": [1058.529785, 566.782471, 223.130005, 341.585083], "object_id": 3, "pose25d": [1179.673828, 815.848633, -8.2e-05, 0.48291, 219.287964, 814.737305, -0.016891, 0.357422,...], "pose3d": [692.748474, 869.897461, 3784.238281, 0.48291, 815.966187, 864.584229, 3776.338867, 0.357422,...] }, { "action": "standing", "bbox": [1652.608154, 29.364517, 151.506958, 285.322723], "object_id": 5, "pose25d": [1730.824219, 166.724609, -0.000231, 0.827148, 1745.931641, 171.605469, -0.092529, 0.803711,...], "pose3d": [4327.349121, -2095.539795, 6736.708984, 0.827148, 4384.155762, -2055.012207, 6693.950195, 0.803711,...] },... ] }], "num_frames_in_batch": 1 },... ]
The skeleton sequence of each object is broken into segments by a dataset converter
(refer to the figure below). The sequence_length
and sequence_overlap
are
configurable in dataset_convert_config_pose_classification.yaml
. The output labels are
assigned to frames after a certain period of time.
End-to-End Inference Using Triton
A sample for end-to-end inference from video is also provided in the TAO Triton Apps. The sample runs deepstream-bodypose-3d to generate metadata of bounding boxes, tracked IDs, and 2D/3D poses that are saved in JSON format. The client implicitly converts the metadata into arrays of skeleton sequences and sends them to the Triton server. The predicted action for each sequence is returned and appended to the JSON metadata at corresponding frames. A video with overlaid metadata is also generated for visualization.
You can start the Triton server using the following command (only the Pose Classification model will be downloaded and converted into a TensorRT engine):
bash scripts/pose_cls_e2e_inference/start_server.sh
Once the Triton server has started, open up another terminal and run the following command to begin body pose estimation using DeepStream and run Pose Classification on the DeepStream output using the Triton server instance that you previously spun up:
bash scripts/pose_cls_e2e_inference/start_client.sh