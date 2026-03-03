ActionRecognitionNet takes a sequence of images as network input and predicts the people’s action in those images. TAO provides the network backbones in 2D/3D with the following input options: RGB-only input, optical flow (OF) only input, and two-stream joint input (RGB+OF).

Action Recognition Net Architecture#

The spec format is YAML for TAO Launcher, and JSON for FTMS Client.

File-related parameters, such as dataset paths or pretrained model paths, are required only for TAO Launcher, not for FTMS Client.

Preparing the Dataset# ActionRecognitionNet requires RGB video frames for the RGB input stream and optical flow vectors for the OF input stream. The x-axis and y-axis of the raw optical flow vectors should be mapped to grayscale images for training. We provide a tool to preprocess sample. This tool converts the video to frames and generate optical flow images based on the NVIDIA Optical Flow (NVOF) SDK. Organize the data in the following structure: /Dataset /class_a /video_1 /rgb 000000 .png 000001 .png ... N.png /u 000000 .jpg 000001 .jpg ... N-1.jpg /v 000000 .jpg 000001 .jpg ... N-1.jpg The root directory of dataset contains multiple sub-directories for different classes. Each class directory has sub-folders for different videos, and each of these subfolders contain rgb , u and v folders that respectively hold RGB frames, optical flow x-axis grayscale images, and optical flow y-axis grayscale images. The u and v folders can be empty if you want to train an RGB-only model. A script is provided to generate RGB frames only. Note The preprocess tool is released on Github under the MIT license. And all-in-one scripts are provided for processing HMDB51 datasets. The common data process pipeline can be depicted with the following diagrams: RGB-Only data process pipeline# OF-Only data process pipeline#

Creating an Experiment Spec File# The spec file for ActionRecognitionNet includes model , train , and dataset parameters. Here is an example spec for training a 3D RGB-only model with a resnet18 backbone on a dataset that contains 5 classes: “walk”, “sits”, “squat”, “fall”, “bend”: TAO Client (v2 API) Use the following command to get an experiment spec file for ActionRecognitionNet: BASE_EXPERIMENT_ID = $( tao action_recognition list-base-experiments | jq -r '.[0].id' ) SPECS = $( tao action_recognition get-job-schema --action train --base-experiment-id $BASE_EXPERIMENT_ID | jq -r '.default' ) TAO Launcher model : model_type : rgb backbone : resnet18 rgb_seq_length : 3 input_type : 3d sample_rate : 1 dropout_ratio : 0.0 dataset : train_dataset_dir : /data/train val_dataset_dir : /data/test label_map : walk : 0 sits : 1 squat : 2 fall : 3 bend : 4 output_shape : - 224 - 224 batch_size : 32 workers : 8 clips_per_video : 15 augmentation_config : train_crop_type : no_crop horizontal_flip_prob : 0.5 rgb_input_mean : [ 0.5 ] rgb_input_std : [ 0.5 ] val_center_crop : False train : optim : lr : 0.0005 momentum : 0.9 weight_decay : 0.0005 lr_scheduler : MultiStep lr_decay : 0.1 lr_steps : [ 15 , 25 ] patience : 1 min_lr : 0.0001 num_epochs : 10 checkpoint_interval : 5 validation_interval : 5 clip_grad_norm : 0.0 num_gpus : 1 gpu_ids : [ 0 ] seed : 1234 Parameter Data Type Default Description Supported Values model dict config – The configuration of the model architecture dataset dict config – The configuration of the dataset train dict config – The configuration of the training task evaluate dict config – The configuration of the evaluation task inference dict config – The configuration of the inference task encryption_key string None The encryption key to encrypt and decrypt model files results_dir string /results The directory where experiment results are saved export dict config – The configuration of the ONNX export task model# The model parameter provides options to change the ActionRecognitionNet architecture. model: model_type: rgb backbone: resnet18 rgb_seq_length: 3 input_type: 3d sample_rate: 1 dropout_ratio: 0 .0 Parameter Datatype Default Description Supported Values model_type string joint The type of model, which can be rgb for the RGB-only model, of for the OF-only model, or joint for the RGB+OF model rgb/of/joint backbone string resnet18 The backbone of the model. Currently supported backbones are ResNet18/34/50/101 resnet18/34/50/101 input_type string 2d The type of input for the model. It can be 2d or 3d . 2d/3d rgb_seq_length unsigned int 3 The number of RGB frames for single inference >0 rgb_pretrained_model_path string None The absolute path to pretrained weights for the RGB model rgb_pretrained_num_classes unsigned int 0 The number of classes for the pretrained RGB model. Use 0 to specify the same number of classes as the current training. >=0 of_seq_length unsigned int 10 The number of optical flow frames for single inference >0 of_pretrained_model_path string None The absolute path to pretrained weights for the OF model of_pretrained_num_classes unsigned int 0 The number of classes for the pretrained RGB model. Use 0 to specify the same number of classes as the current training. >=0 joint_pretrained_model_path string None The absolute path to pretrained weights for the joint model num_fc unsigned int 64 The number of hidden units for the joint model >0 sample_rate unsigned int 1 The sample rate to pick consecutive frames. For example, if the sample_rate is 2, the frame will be picked every 2 frames. >0 dropout_ratio float 0.5 The probability to drop out hidden units 0.0 ~ 1.0 train# The train parameter defines the hyperparameters of the training process. train: optim: lr: 0 .0005 momentum: 0 .9 weight_decay: 0 .0005 lr_scheduler: MultiStep lr_decay: 0 .1 lr_steps: [ 15 , 25 ] patience: 1 min_lr: 0 .0001 num_epochs: 10 checkpoint_interval: 5 validation_interval: 5 clip_grad_norm: 0 .0 num_gpus: 1 gpu_ids: [ 0 ] seed: 1234 Parameter Datatype Default Description Supported Values num_gpus unsigned int 1 The number of GPUs to use for distributed training >0 gpu_ids List[int] [0] The indices of the GPU’s to use for distributed training seed unsigned int 1234 The random seed for random, NumPy, and torch >0 num_epochs unsigned int 10 The total number of epochs to run the experiment >0 checkpoint_interval unsigned int 1 The epoch interval at which the checkpoints are saved >0 validation_interval unsigned int 1 The epoch interval at which the validation is run >0 resume_training_checkpoint_path string The intermediate PyTorch Lightning checkpoint to resume training from results_dir string /results/train The directory to save training results optim dict config The config for SGD optimizer, including the learning rate, learning scheduler, and weight decay >1 clip_grad_norm float 0.0 The amount to clip the gradient by the L2 norm. 0.0 means don’t clip >=0 optim# The optim parameter defines the config for the SGD optimizer in training, including the learning rate, learning scheduler, and weight decay. optim: lr: 0 .0005 momentum: 0 .9 weight_decay: 0 .0005 lr_scheduler: MultiStep lr_decay: 0 .1 lr_steps: [ 15 , 25 ] patience: 1 min_lr: 0 .0001 Parameter Datatype Default Description Supported Values lr float 5e-4 The initial learning rate for the training >0.0 momentum float 0.9 The momentum for the SGD optimizer >0.0 weight_decay float 5e-4 The weight decay coefficient >0.0 lr_scheduler



lr_decay float 0.1 The decreasing factor for learning rate scheduler >0.0 lr_steps int list [15, 25] The steps to decrease the learning rate for the MultiStep scheduler int list lr_monitor string val_loss The monitor value for the AutoReduce scheduler val_loss/train_loss patience unsigned int 1 The number of epochs with no improvement, after which learning rate will be reduced >0 min_lr float 1e-4 The minimum learning rate in the training >0.0 dataset# The dataset parameter defines the dataset source, training batch size, and augmentation. dataset: train_dataset_dir: /data/train val_dataset_dir: /data/test label_map: walk: 0 sits: 1 squa: 2 fall: 3 bend: 4 output_shape: - 224 - 224 batch_size: 32 workers: 8 clips_per_video: 15 augmentation_config: train_crop_type: no_crop horizontal_flip_prob: 0 .5 rgb_input_mean: [ 0 .5 ] rgb_input_std: [ 0 .5 ] val_center_crop: False Parameter Datatype Default Description Supported Values train_dataset_dir string The path to the train dataset val_dataset_dir string The path to the validation dataset label_map dict A dict that maps the class names to indices output_shape list [224, 224] The output shape after augmentation unsigned int list with size=2 batch_size unsigned int 32 The batch size for training and validation >0 workers unsigned int 8 The number of parallel workers processing data >0 clips_per_video unsigned int 1 The number of clips sampled from a video in an epoch >0 augmentation_config dict config The parameters to define the augmentation method Note For a 3D model, the input layout is NCDHW , where N is the batch size, C is the input channel, D is the depth or sequence length, H is the image height, and W is the image width. For a 2D model, the input layout is N[CxD]HW . augmentation_config# The augmentation_config parameter contains hyperparameters for augmentation. augmentation_config: train_crop_type: no_crop horizontal_flip_prob: 0 .5 rgb_input_mean: [ 0 .5 ] rgb_input_std: [ 0 .5 ] val_center_crop: False Parameter Datatype Default Description Supported Values train_crop_type







scales float list [1.0] The scales to generate the crop pattern in multi_scale_crop float list / >0.0 rgb_input_mean float list [0.485, 0.456, 0.406] The input mean for RGB frames: (input - mean) / std float list / size=1 or 3 rgb_input_std float list [0.229, 0.224, 0.225] The input std for RGB frames: (input - mean) / std float list / size=1 or 3 of_input_mean float list [0.5] The input mean for OF frames: (input - mean) / std float list / size=1 or 3 of_input_rgb float list [0.5] The input std for OF frames: (input - mean) / std float list / size=1 or 3 val_center_crop bool False Specifies whether to center crop the images in validation. crop_smaller_edge



Training the Model# Use the following command to run ActionRecognitionNet training: TAO Client (v2 API) TRAIN_JOB_ID = $( tao action_recognition create-job \ --kind experiment \ --name "action_recognition_train" \ --action train \ --workspace-id $WORKSPACE_ID \ --specs " $TRAIN_SPECS " \ --train-datasets '["' $DATASET_ID '"]' \ --eval-dataset " $DATASET_ID " \ --base-experiment-ids '["' $BASE_EXPERIMENT_ID '"]' \ --encryption-key "nvidia_tlt" | jq -r '.id' ) TAO Launcher tao model action_recognition train [ -h ] -e <experiment_spec_file> [ results_dir = <global_results_dir> ] [ model.<model_option> = <model_option_value> ] [ dataset.<dataset_option> = <dataset_option_value> ] [ train.<train_option> = <train_option_value> ] [ train.gpu_ids = <gpu indices> ] [ train.num_gpus = <number of gpus> ] Required Arguments The only required argument is the path to the experiment spec: -e, --experiment_spec : The experiment specification file to set up the training experiment Optional Arguments You can set optional arguments to override the option values in the experiment spec file. -h, --help : Show this help message and exit.

model.<model_option> : The model options.

dataset.<dataset_option> : The dataset options.

train.<train_option> : The train options.

train.optim.<optim_option> : The optimizer options Note For training, evaluation, and inference, we expose two variables for each task: num_gpus and gpu_ids , which default to 1 and [0] , respectively. If both are passed, but are inconsistent, for example num_gpus = 1 , gpu_ids = [0, 1] , then they are modified to follow the setting that implies more GPUs; in the same example num_gpus is modified from 1 to 2. In some cases multi-GPU training may result in a segmentation fault. You can circumvent this by setting the enviroment variable OMP_NUM_THREADS to 1. Depending upon your model of execution, you may use the following methods to set this variable: CLI Launcher : You may set the environment variable by adding the following fields to the Envs field of your ~/.tao_mounts.json file as mentioned in bullet 3 in ths section Running the launcher. { "Envs" : [ { "variable" : "OMP_NUM_THREADSR" , "value" : "1" } }

Docker: You may set environment variables in Docker by setting the -e flag in the Docker command line. docker run -it --rm --gpus all \ -e OMP_NUM_THREADS = 1 \ -v /path/to/local/mount:/path/to/docker/mount nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt <model> train -e Checkpointing and Resuming Training At every train.checkpoint_interval , a PyTorch Lightning checkpoint is saved. It is called model_epoch_<epoch_num>.pth . Checkpoints are saved in train.results_dir , like this: $ ls /results/train 'model_epoch_000.pth' 'model_epoch_001.pth' 'model_epoch_002.pth' 'model_epoch_003.pth' 'model_epoch_004.pth' he latest checkpoint is also saved as ar_model_latest.pth . Training automatically resumes from ar_model_latest.pth , if it exists in train.results_dir . This is superseded by train.resume_training_checkpoint_path , if it is provided. The major implication of this logic is that, if you wish to trigger fresh training from scratch, either: Specify a new, empty results directory (Recommended)

Remove the latest checkpoint from the results directory

Evaluating the Model# The evaluation metric of ActionRecognitionNet is recognition accuracy. Two modes of video sampling strategies are provided for evaluation on a video: center and conv . The center evaluation inference is performed on the middle part of frames in a video clip. For example, if the model requires 32 frames as input and a video clip has 128 frames, then the frames from index 48 to index 79 will be used to perform inference. The conv evaluation inference is performed on a number of segments out of a video clip. For example, a video clip is divided uniformly into 10 parts; the center of each segments is treated as a starting point from which 32 consecutive frames are chosen to form an inference segment. In this manner, an inference segment is generated for every part the video was divided into. And the final label of the video is determined by the average score of those 10 segments. Use the following command to run ActionRecognitionNet evaluation: TAO Client (v2 API) EVAL_JOB_ID = $( tao action_recognition create-job \ --kind experiment \ --name "action_recognition_evaluate" \ --action evaluate \ --workspace-id $WORKSPACE_ID \ --parent-job-id $TRAIN_JOB_ID \ --eval-dataset " $DATASET_ID " \ --specs " $EVALUATE_SPECS " \ --base-experiment-ids '["' $BASE_EXPERIMENT_ID '"]' \ --encryption-key "nvidia_tlt" | jq -r '.id' ) TAO Launcher tao model action_recognition evaluate -e <experiment_spec_file> evaluate.checkpoint = <model to be evaluated> [ evaluate.batch_size = <batch size> ] [ evaluate.test_dataset_dir = <path to test dataset> ] [ evaluate.video_eval_mode = <evaluation mode for the video> ] [ evaluate.video_num_segments = <number of segments for `` conv `` mode> ] [ evaluate.gpu_ids = <gpu indices> ] [ evaluate.num_gpus = <number of gpus> ] Required Arguments The following arguments are required. -e, --experiment_spec_file : THe xperiment spec file to set up the evaluation experiment. This should be the same as a training spec file.

evaluate.checkpoint : The .pth model. Optional Arguments The following arguments are optional to run the command. evaluate.batch_size : The batch size to perform inference in evaluation. The default value is 1.

evaluate.test_dataset_dir : The path to the test dataset. If not set, the validation dateset in the experiment_spec will be used.

evaluate.video_eval_mode : The evaluation mode for the video: center : Evaluation inference is performed on the middle part of frames in the video clip. This is the default mode. conv : Evaluation inference is performed on a number of segments out of a video clip. The final prediction is averaged among all the segments.

evaluate.video_num_segments : The number of segments sampled in a video clip for conv evaluation mode. The default value is 10.

evaluate.results_dir : The results directory. Defaults to /results/evaluate . Multi-GPU evaluation is currently not supported for Action Recognition.

Running Inference on the Model# Use the following command to run inference on ActionRecognitionNet with the .pth model. TAO Client (v2 API) INFER_JOB_ID = $( tao action_recognition create-job \ --kind experiment \ --name "action_recognition_inference" \ --action inference \ --workspace-id $WORKSPACE_ID \ --parent-job-id $TRAIN_JOB_ID \ --inference-dataset " $DATASET_ID " \ --specs " $INFERENCE_SPECS " \ --base-experiment-ids '["' $BASE_EXPERIMENT_ID '"]' \ --encryption-key "nvidia_tlt" | jq -r '.id' ) TAO Launcher tao model action_recognition inference -e <experiment_spec> inference.checkpoint = <inference model> inference.inference_dataset_dir = <path to dataset to be inferenced> [ inference.batch_size = <batch size> ] [ inference.video_inf_mode = <inference > ] [ inference.video_num_segments ] [ inference.gpu_ids = <gpu indices> ] [ inference.num_gpus = <number of gpus> ] Required Arguments The following arguments are required. -e, --experiment_spec : The experiment spec file to set up inference. This can be the same as the training spec.

inference.checkpoint : The .pth model to perform inference with.

inference.inference_dataset_dir : The path to the dataset to perform inference with. It should be a class-level directory, as described in the Preparing the Dataset section. Optional Arguments The following arguments are optional to run the command. inference.batch_size : The batch size to perform inference in evaluation. The default value is 1.

inference.video_inf_mode : The inference mode for the video: center : Inference is performed on the middle part of frames in the video clip. This is the default mode. conv : Inference is performed on a number of segments in a video clip. All the segment preidctions are kept in a label list.

inference.video_num_segments : The number of segments sampled in a video clip for the conv inference mode.

inference.results_dir : The results directory. Defaults to /results/inference . The output is formatted as [video_sample_path] [labels list of inference segments in this video] . Multi-GPU inference is currently not supported for Action Recognition. The expected output for the fall class would be as follows: /path/to/fall/video_1 [ fall ] /path/to/fall/video_2 [ fall ] ...

Exporting the Model# Use the following command to export ActionRecognitionNet to .etlt format for deployment: TAO Client (v2 API) EXPORT_JOB_ID = $( tao action_recognition create-job \ --kind experiment \ --name "action_recognition_export" \ --action export \ --workspace-id $WORKSPACE_ID \ --parent-job-id $TRAIN_JOB_ID \ --specs " $EXPORT_SPECS " \ --base-experiment-ids '["' $BASE_EXPERIMENT_ID '"]' \ --encryption-key "nvidia_tlt" | jq -r '.id' ) TAO Launcher tao model action_recognition export -e <experiment_spec> export.checkpoint = <tlt checkpoint to be exported> [ export.gpu_id = <gpu index> ] [ export.onnx_file = <path to exported file> ] Required Arguments The following arguments are required. -e, --experiment_spec : The experiment spec file to set up export. This can be the same as the training spec.

export.checkpoint : The .pth model to be exported. Optional Arguments The following arguments are optional to run the command. export.gpu_id : The GPU index used to run the export. We can specify the GPU index used to run export when the machine has multiple GPUs installed. Note that export can only run on a single GPU

export.onnx_file : The path to save the exported model to. The default path is in the same directory of \*.pth model.