ActionRecognitionNet#

ActionRecognitionNet takes a sequence of images as network input and predicts the people’s action in those images. TAO provides the network backbones in 2D/3D with the following input options: RGB-only input, optical flow (OF) only input, and two-stream joint input (RGB+OF).

../../../_images/action_recognition_arch.png

Action Recognition Net Architecture#

Note

  • Throughout this documentation, you will see references to $EXPERIMENT_ID and $DATASET_ID in the FTMS Client sections.

    • For instructions on creating a dataset using the remote client, see the Creating a dataset section in the Remote Client documentation.

    • For instructions on creating an experiment using the remote client, see the Creating an experiment section in the Remote Client documentation.

  • The spec format is YAML for TAO Launcher and JSON for FTMS Client.

  • File-related parameters, such as dataset paths or pretrained model paths, are required only for TAO Launcher and not for FTMS Client.

Preparing the Dataset#

ActionRecognitionNet requires RGB video frames for the RGB input stream and optical flow vectors for the OF input stream. The x-axis and y-axis of the raw optical flow vectors should be mapped to grayscale images for training. We provide a tool to preprocess sample. This tool converts the video to frames and generate optical flow images based on the NVIDIA Optical Flow (NVOF) SDK.

Organize the data in the following structure:

/Dataset
    /class_a
        /video_1
            /rgb
                000000.png
                000001.png
                ...
                N.png
            /u
                000000.jpg
                000001.jpg
                ...
                N-1.jpg
            /v
                000000.jpg
                000001.jpg
                ...
                N-1.jpg

The root directory of dataset contains multiple sub-directories for different classes. Each class directory has sub-folders for different videos, and each of these subfolders contain rgb, u and v folders that respectively hold RGB frames, optical flow x-axis grayscale images, and optical flow y-axis grayscale images. The u and v folders can be empty if you want to train an RGB-only model. A script is provided to generate RGB frames only.

Note

The preprocess tool is released on Github under the MIT license. And all-in-one scripts are provided for processing HMDB51 datasets.

The common data process pipeline can be depicted with the following diagrams:

../../../_images/rgb_preprocess_pipe.png

RGB-Only data process pipeline#

../../../_images/of_preprocess_pipe.png

OF-Only data process pipeline#

Creating an Experiment Spec File#

The spec file for ActionRecognitionNet includes model, train, and dataset parameters. Here is an example spec for training a 3D RGB-only model with a resnet18 backbone on a dataset that contains 5 classes: “walk”, “sits”, “squat”, “fall”, “bend”:

Use the following command to get an experiment spec file for ActionRecognitionNet:

SPECS=$(tao-client action_recognition get-spec --action train --job_type experiment --id $EXPERIMENT_ID)

Parameter

Data Type

Default

Description

Supported Values

model

dict config

The configuration of the model architecture

dataset

dict config

The configuration of the dataset

train

dict config

The configuration of the training task

evaluate

dict config

The configuration of the evaluation task

inference

dict config

The configuration of the inference task

encryption_key

string

None

The encryption key to encrypt and decrypt model files

results_dir

string

/results

The directory where experiment results are saved

export

dict config

The configuration of the ONNX export task

model#

The model parameter provides options to change the ActionRecognitionNet architecture.

model:
  model_type: rgb
  backbone: resnet18
  rgb_seq_length: 3
  input_type: 3d
  sample_rate: 1
  dropout_ratio: 0.0

Parameter

Datatype

Default

Description

Supported Values

model_type

string

joint

The type of model, which can be rgb for the RGB-only model, of for the OF-only model, or joint for the RGB+OF model

rgb/of/joint

backbone

string

resnet18

The backbone of the model. Currently supported backbones are ResNet18/34/50/101

resnet18/34/50/101

input_type

string

2d

The type of input for the model. It can be 2d or 3d.

2d/3d

rgb_seq_length

unsigned int

3

The number of RGB frames for single inference

>0

rgb_pretrained_model_path

string

None

The absolute path to pretrained weights for the RGB model

rgb_pretrained_num_classes

unsigned int

0

The number of classes for the pretrained RGB model. Use 0 to specify the same number of classes as the current training.

>=0

of_seq_length

unsigned int

10

The number of optical flow frames for single inference

>0

of_pretrained_model_path

string

None

The absolute path to pretrained weights for the OF model

of_pretrained_num_classes

unsigned int

0

The number of classes for the pretrained RGB model. Use 0 to specify the same number of classes as the current training.

>=0

joint_pretrained_model_path

string

None

The absolute path to pretrained weights for the joint model

num_fc

unsigned int

64

The number of hidden units for the joint model

>0

sample_rate

unsigned int

1

The sample rate to pick consecutive frames. For example, if the sample_rate is 2, the frame will be picked every 2 frames.

>0

dropout_ratio

float

0.5

The probability to drop out hidden units

0.0 ~ 1.0

train#

The train parameter defines the hyperparameters of the training process.

train:
  optim:
    lr: 0.0005
    momentum: 0.9
    weight_decay: 0.0005
    lr_scheduler: MultiStep
    lr_decay: 0.1
    lr_steps: [15, 25]
    patience: 1
    min_lr: 0.0001
  num_epochs: 10
  checkpoint_interval: 5
  validation_interval: 5
  clip_grad_norm: 0.0
  num_gpus: 1
  gpu_ids: [0]
  seed: 1234

Parameter

Datatype

Default

Description

Supported Values

num_gpus

unsigned int

1

The number of GPUs to use for distributed training

>0

gpu_ids

List[int]

[0]

The indices of the GPU’s to use for distributed training

seed

unsigned int

1234

The random seed for random, NumPy, and torch

>0

num_epochs

unsigned int

10

The total number of epochs to run the experiment

>0

checkpoint_interval

unsigned int

1

The epoch interval at which the checkpoints are saved

>0

validation_interval

unsigned int

1

The epoch interval at which the validation is run

>0

resume_training_checkpoint_path

string

The intermediate PyTorch Lightning checkpoint to resume training from

results_dir

string

/results/train

The directory to save training results

optim

dict config

The config for SGD optimizer, including the learning rate, learning scheduler, and weight decay

>1

clip_grad_norm

float

0.0

The amount to clip the gradient by the L2 norm. 0.0 means don’t clip

>=0

optim#

The optim parameter defines the config for the SGD optimizer in training, including the learning rate, learning scheduler, and weight decay.

optim:
  lr: 0.0005
  momentum: 0.9
  weight_decay: 0.0005
  lr_scheduler: MultiStep
  lr_decay: 0.1
  lr_steps: [15, 25]
  patience: 1
  min_lr: 0.0001

Parameter

Datatype

Default

Description

Supported Values

lr

float

5e-4

The initial learning rate for the training

>0.0

momentum

float

0.9

The momentum for the SGD optimizer

>0.0

weight_decay

float

5e-4

The weight decay coefficient

>0.0

lr_scheduler


string


MultiStep


The learning scheduler. Two schedulers are provided:
* MultiStep : decrease the lr by lr_decay at setting steps;
* AutoReduce : decrease the lr by lr_decay while lr_monitor doesn’t decline more than 0.1% of the previous value.
MultiStep/AutoReduce


lr_decay

float

0.1

The decreasing factor for learning rate scheduler

>0.0

lr_steps

int list

[15, 25]

The steps to decrease the learning rate for the MultiStep scheduler

int list

lr_monitor

string

val_loss

The monitor value for the AutoReduce scheduler

val_loss/train_loss

patience

unsigned int

1

The number of epochs with no improvement, after which learning rate will be reduced

>0

min_lr

float

1e-4

The minimum learning rate in the training

>0.0

dataset#

The dataset parameter defines the dataset source, training batch size, and augmentation.

dataset:
  train_dataset_dir: /data/train
  val_dataset_dir: /data/test
  label_map:
    walk: 0
    sits: 1
    squa: 2
    fall: 3
    bend: 4
  output_shape:
  - 224
  - 224
  batch_size: 32
  workers: 8
  clips_per_video: 15
  augmentation_config:
    train_crop_type: no_crop
    horizontal_flip_prob: 0.5
    rgb_input_mean: [0.5]
    rgb_input_std: [0.5]
    val_center_crop: False

Parameter

Datatype

Default

Description

Supported Values

train_dataset_dir

string

The path to the train dataset

val_dataset_dir

string

The path to the validation dataset

label_map

dict

A dict that maps the class names to indices

output_shape

list

[224, 224]

The output shape after augmentation

unsigned int list with size=2

batch_size

unsigned int

32

The batch size for training and validation

>0

workers

unsigned int

8

The number of parallel workers processing data

>0

clips_per_video

unsigned int

1

The number of clips sampled from a video in an epoch

>0

augmentation_config

dict config

The parameters to define the augmentation method

Note

For a 3D model, the input layout is NCDHW, where N is the batch size, C is the input channel, D is the depth or sequence length, H is the image height, and W is the image width.

For a 2D model, the input layout is N[CxD]HW.

augmentation_config#

The augmentation_config parameter contains hyperparameters for augmentation.

augmentation_config:
  train_crop_type: no_crop
  horizontal_flip_prob: 0.5
  rgb_input_mean: [0.5]
  rgb_input_std: [0.5]
  val_center_crop: False

Parameter

Datatype

Default

Description

Supported Values

train_crop_type




string




random_crop




The crop type when training:
* random_crop: Randomly crop the output_shape area on the images.
* multi_scale_crop: Crop the four corners and center of a image
with multiple scales and randomly pick one cropped from them.
* no_crop: Don’t crop the training images.
random_crop
multi_scale_crop
no_crop


scales

float list

[1.0]

The scales to generate the crop pattern in multi_scale_crop

float list / >0.0

rgb_input_mean

float list

[0.485, 0.456, 0.406]

The input mean for RGB frames: (input - mean) / std

float list / size=1 or 3

rgb_input_std

float list

[0.229, 0.224, 0.225]

The input std for RGB frames: (input - mean) / std

float list / size=1 or 3

of_input_mean

float list

[0.5]

The input mean for OF frames: (input - mean) / std

float list / size=1 or 3

of_input_rgb

float list

[0.5]

The input std for OF frames: (input - mean) / std

float list / size=1 or 3

val_center_crop

bool

False

Specifies whether to center crop the images in validation.

crop_smaller_edge


unsigned int


256


Specifies whether to resize the images, with crop_smaller_edge applied to the short side before random_crop in training or center_crop in validation.

>0

Training the Model#

Use the following command to run ActionRecognitionNet training:

TRAIN_JOB_ID=$(tao-client action_recognition experiment-run-action --action train --id $EXPERIMENT_ID --specs "$SPECS")

Checkpointing and Resuming Training

At every train.checkpoint_interval, a PyTorch Lightning checkpoint is saved. It is called model_epoch_<epoch_num>.pth. These are saved in train.results_dir, like so:

$ ls /results/train

'model_epoch_000.pth'
'model_epoch_001.pth'
'model_epoch_002.pth'
'model_epoch_003.pth'
'model_epoch_004.pth'

The latest checkpoint is also saved as ar_model_latest.pth. Training automatically resumes from ar_model_latest.pth, if it exists in train.results_dir. This is superseded by train.resume_training_checkpoint_path, if it is provided.

The major implication of this logic is that, if you wish to trigger fresh training from scratch, either:

  • Specify a new, empty results directory (Recommended)

  • Remove the latest checkpoint from the results directory

Evaluating the Model#

The evaluation metric of ActionRecognitionNet is recognition accuracy. Two modes of video sampling strategies are provided for evaluation on a video: center and conv.

The center evaluation inference is performed on the middle part of frames in a video clip. For example, if the model requires 32 frames as input and a video clip has 128 frames, then the frames from index 48 to index 79 will be used to perform inference.

The conv evaluation inference is performed on a number of segments out of a video clip. For example, a video clip is divided uniformly into 10 parts; the center of each segments is treated as a starting point from which 32 consecutive frames are chosen to form an inference segment. In this manner, an inference segment is generated for every part the video was divided into. And the final label of the video is determined by the average score of those 10 segments.

Use the following command to run ActionRecognitionNet evaluation:

EVAL_JOB_ID=$(tao-client action_recognition experiment-run-action --action evaluate --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")

Multi-GPU evaluation is currently not supported for Action Recognition.

Running Inference on the Model#

Use the following command to run inference on ActionRecognitionNet with the .pth model.

INFER_JOB_ID=$(tao-client action_recognition experiment-run-action --action inference --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")

The output is formatted as [video_sample_path] [labels list of inference segments in this video].

Multi-GPU inference is currently not supported for Action Recognition.

The expected output for the fall class would be as follows:

/path/to/fall/video_1 [fall]
/path/to/fall/video_2 [fall]
...

Exporting the Model#

Use the following command to export ActionRecognitionNet to .etlt format for deployment:

EXPORT_JOB_ID=$(tao-client action_recognition experiment-run-action --action export --id $EXPERIMENT_ID --parent_job_id $TRAIN_JOB_ID --specs "$SPECS")

Deploying the Model#

The deep learning and computer vision models that you trained can be deployed on edge devices, such as a Jetson Xavier, Jetson Nano, or Tesla, or in the cloud with NVIDIA GPUs. The exported \*.etlt model can be used in a stand-alone TensorRT inference sample or in DeepStream.

DeepStream SDK is a streaming analytic toolkit to accelerate building AI-based video analytic applications. TAO is integrated with DeepStream SDK, so models trained with TAO will work out of the box with Deepstream.

Deploying the ActionRecognitionNet in the DeepStream Sample#

Once you get the .etlt ActionRecognitionNet model, you can deploy it into the DeepStream 3d-action-recognition sample app. Refer to the sample applications documentation for detailed steps to run action recogintion in DeepStream.

Running ActionRecognitionNet Inference on the Stand-Alone Sample#

A stand-alone TensorRT inference sample is also provided. It consumes the TensorRT engine and supports running with 2D/3D input on images. The sample can be found on Github.

To use this sample, you need to generate the TensorRT engine out of a \*.etlt model using trtexec.

Using trtexec#

For instructions on generating TensorRT engine using trtexec command, refer to trtexec guide for ActionRecognitionNet.

Usage of Inference Sample#

After you get the TensorRT engine, you can deploy the engine in the stand-alone sample. Use the following command to run inference:

python ar_trt_inference.py --input_images_folder <path to input images folder> \
                           --trt_engine <path to tensorrt engine> \
                           [--center_crop] \
                           [--input_2d]
Required Arguments#
  • --input_images_folder: The path to input images folder. It should be a video_<n> level directory as described in the Preparing the Dataset. section.

  • --trt_engine: The path to the TensorRT engine.

Optional Arguments#
  • --center_crop: Resizes the input images with a short side to 256 and center crops to a 224x224 area. If this flag is not set, the input images will be directly resized to 224x224.

  • --input_2d: Set this flag if the engine is generated from a 2D model.

Note

The script does inference on the images in the folder through a 32-len sliding window with stride 1, which means the inference is done on sequences as follows:

[frame_0, frame_1, frame_2, ..., frame_31]
[frame_1, frame_2, frame_3, ..., frame_32]
....