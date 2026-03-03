FasterRCNN is a public object detection model that is supported by NVIDIA TAO. FasterRCNN in TAO supports below tasks:

These tasks can be invoked from the TAO Launcher using the following convention on the command-line:

where, args_per_subtask are the command-line arguments required for a given subtask. Each subtask is explained in detail in the following sections.

--gpu_index : The GPU index to run this command on. We can specify the GPU index used to run this command if the machine has multiple GPUs installed. Note that this command can only run on a single GPU.

The dataset structure of FasterRCNN is identical to that of DetectNet_v2 . The only difference is the command line used to generate the TFRecords from KITTI text labels. To generate TFRecords for FasterRCNN training, use this command:

In the above table, the definition of trt_evaluation is the same as trt_inference parameter described before.

The parameter visualize_pr_curve , if set to True , will produce an image of precision-recall curve during the evaluate command, the exact path of the image can be seen in the screen log. By checking the image, we can see each class’s performance regarding the tradeoff between precision and recall.

Object confidence score threshold in NMS. All the objects whose confidence is lower than this number will filtered out in NMS.

The maximum number of boxes(ROIs) to be retained after the NMS in Proposal layer in evaluation.

The number of boxes(ROIs) to be retained before the NMS in Proposal layer in evaluation.

Path to the model to run evaluation. Can be either a .tlt model or a TensorRT engine.

The parameter evaluation_config defines all the required parameters for running evaluation against a FasterRCNN model. This parameter is very similar to inference_config .

The parameter trt_inference defines all the parameters for TensorRT based inference. When specified, Inference will use TensorRT engine instead of the .tlt model. The TensorRT engine is assumed to be generated by the tao-converter tool. All the parameters are summarized in the table below.

The number of bits to represent the score values in NMS plugin in TensorRT OSS. The valid range is integers in [1, 10]. Setting it to any other values will make it fall back to ordinary NMS. Currently this optimized NMS plugin is only available in FP16 but it should also be selected by INT8 data type as there is no INT8 NMS in TensorRT OSS and hence this fastest implementation in FP16 will be selected. If falling back to ordinary NMS, the actual data type when building the engine will decide the exact precision(FP16 or FP32) to run at.

The configurations for TensorRT based inference. If this parameter is set, inference will use TensorRT engine instead of .tlt model.

A flag to display the class name and confidence for each detected object in an image.

The maximum number of boxes (ROIs) to be retained after the NMS in Proposal layer in inference.

The number of boxes (ROIs) to be retained before the NMS in Proposal layer in inference.

The delta of the minimum value of monitor value below which we regard it as not decreasing.

The parameters for early stopping are described in the table below.

During the training, visualization can be done anywhere that can access the TensorBoard log directory. Usually the TAO containers will map volumes to host machine, so TensorBoard can be called on host machine. The command tensorboard --logdir=/path/to/logs can be used to open the TensorBoard visualization GUI in web browser. Make sure tensorboard is installed before running this command. One can run pip3 install tensorboard to install it if it is not installed in the environment. The /path/to/logs argument is the path to the directory used to save the .tlt model, with the suffix /logs appended.

The parameter num_images is used to limit the maximum number of images to be visualized on the image tab in TensorBoard.

If the parameter enabled is set to True , then all above visualizations will be enabled. Otherwise, all visualization will be disabled.

Visualization during training supports 3 types of visualizations, namely: scalar, image and histogram. These types of visualization all leverage the TensorBoard tool. Each type will have a tab in TensorBoard GUI interface. With the scalar tab, it can visualize scalars like loss, learning rate and validation mAP over time(training step). With the image tab, it can visualize augmented images during training, with bounding boxes drawn on the them. With the histogram tab, it can visualize histograms of each layer’s weights and bias of the model being trained.

Visualization during training is configured by the visualizer parameter. The parameters of it are described in the table below.

The learning rate is automatically scaled with the number of GPUs used during training, or the effective learning rate is learning_rate * n_gpu .

the step size (in percentage of total epochs) at which the learning rate is multiplied by gamma.

The parameters of step scheduler is described in the table below.

List of time points at which to decrease the learning rate. Also in percentage.

The duration (in percentage of total epochs) of the soft start phase of the learning rate curve.

The parameters of soft_start scheduler is described in the table below.

The parameter learning_rate defines the learning rate scheduler in a FasterRCNN training. Two types of learning rate schedulers are supported in FasterRCNN: soft_start and step . NO matter which one is chosen, it will be wrapped in a learning_rate proto message. For example:

learning rate. This is actually overridden by the learning rate scheduler and hence not useful.

Three types of optimizers are supported by FasterRCNN: Adam, SGD and RMSProp. Only one of them should be specified in spec file. No matter which one is chosen, it will be wrapped in a optimizer proto. For example:

List of fraction for model parallelism. Each number is a fraction that represents the percentage of model layers to be placed on a GPU. For example two repeated model_parallelism: 0.5 indicates the training will use 2 GPUs and each GPU will have a half of model layers on it.

The maximum number of boxes (ROIs) to be retained after the NMS in Proposal layer.

The number of boxes (ROIs) to be retained before the NMS in Proposal layer.

The period in epochs that we will save the checkpoint. Setting this number to be greater than num_epochs will essentially disable checkpointing.

Scaling factors (denominators) for the RCNN regression loss. A map from ‘x’, ‘y’, ‘w’, ‘h’ to its corresponding scaling factor, respectively.

The higher IoU threshold used to generate the proposal target. If the IoU of a ROI and a groundtruth box is above this number, then this ROI is regarded as a positive ROI during training.

The lower IoU threshold used to generate the proposal target. If the IoU of a ROI and a groundtruth box is above this number and below classifier_max_overlap, then this ROI is regarded as a negative ROI (background) during training.

The higher IoU threshold used to match anchor boxes to groundtruth boxes. If the IoU of an anchor box and some groundtruth box is higher this threshold, then this anchor box will be regarded as an positive anchor box.

The lower IoU threshold used to match anchor boxes to groundtruth boxes. If the IoU of an anchor box and any groundtruth box is below this threshold, then this anchor box will be regarded as an negative anchor box.

The path to the model for which that we are going to resume an interrupted training.

The path to the pruned model that we are going to retrain.

The proto message training_config defines all the necessary parameters required for a FasterRCNN training experiment. Each parameter is described in the table below.

The parameter activation defines the type and parameter for the activation function in a FasterRCNN model. This parameter is only valid for EfficientNet.

A flag to double the pooled ROIs’ size. If this is set to True. CropAndResize will produces ROIs of size 2*pool_size and in RCNN it will be downsampled 2x to get back to pool_size.

The output spatial size (height and width) of the pooled ROIs. Only square ROIs are supported, so this parameter is for both height and width.

The roi_pooling_config parameter defines the parameters required in ROIPooling(CropAndResize) layer in the model. Described in the table below.

The parameter anchor_box_config defines the anchor box sizes and aspect ratios in the FasterRCNN model. There are two sub-parameters for it: scale and ratio . Each of them is a list of floats as below.

The maximum number of objects in an image depends on the dataset. It is important to set the parameter max_objects_num_per_image to be no less than this number. Otherwise, training will fail.

The image scaling factor to scale the images. Each pixel value will be divided by this number.

proto message with only one min parameter to specify the smaller side size in pixel.

The input_image_config defines the supported format of images by FasterRCNN model. We can customize the input image size, the per-channel mean values and scaling factor for image preprocessing. We can also specify the image type (RGB or grayscale) for our training/validation dataset, and the order of the channel if we are going to use RGB images during training. This is described in the table below in detail.

Each of the above proto message parameters will be described in detail below.

Defines the activation function used in the model. Only valid for EfficientNet. For INT8 deployment, EfficientNet with relu activation will produces much better accuracy (mAP) than the original swish activation.

A flag to use pooling layers in the model or not. This parameter is valid only for VGG and ResNet. If set to True, pooling layers will be used in the model(produces the same model structures as in papers). Otherwise, strided convlutional layers will be used and pooling layers will be omitted.

A flag to replace all the shortcut layers with projection layers in the model. Only valid for ResNet and MobileNet V2.

A flag to use bias for convlutional layers in the model. If the model has BatchNormalization layers, we usually set it to False.

list of ints. For ResNet, the valid block IDs for freezing is any subset of {0, 1, 2, 3}(inclusive). For VGG, the valid block IDs for freezing is any subset of {1, 2, 3, 4, 5}(inclusive). For GoogLeNet, the valid block IDs for freezing is any subset of {0, 1, 2, 3, 4, 5, 6, 7} (inclusive). For MobileNet V1, the valid block IDs is any subset of {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}(inclusive). For MobileNet V2, the valid block IDs is any subset of {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13}(inclusive). For DarkNet, the valid blocks IDs is any subset of {0, 1, 2, 3, 4, 5}(inclusive). For EfficientNet, the valid block IDs is any subset of { 0, 1, 2, 3, 4, 5, 6, 7}(inclusive).

The list of block IDs to freeze during training. Some times we want to freeze some blocks in the model after loading the pretrained models for some reason (save GPU memory, make training process more stable, etc.).

The dropout rate is applicable to the Dropout layers in the model(if there are any). Currently only VGG 16/19 and EfficientNet has Dropout layers.

A flag to freeze all the BatchNormalization layers in the model. Freezing a BatchNormalization layer means freezing its moving mean and moving variance while its gamma and beta parameters are still trainable. This is usually used in FasterRCNN training with a small batch size so the moving means and moving variances are initialized from the pretrained model and fixed during training.

Here a notational convention can be used, i.e., for models that can have different numbers of layers, use a colon followed by the layer number as the suffix of the model name. E.g., resnet:<layer_number>

str type. The architecture can be ResNet, VGG , GoogLeNet, MobileNet or DarkNet. For each specific architecture, it can have different layers or versions. Details listed below.

Defines the input image format, including the image channel number, channel order, width and height, and the preprocessings (subtract per-channel mean and divided by a scaling factor) for it before feeding input the model. See below for details.

The model_config defines the FasterRCNN model architecture. In this parameter, we can choose the backbone of the FasterRCNN model, enabling BatchNormalization layers or not, whether or not to freeze the BatchNormalization layers during training, and whether or not to freeze some blocks in the model during training. With this parameter, we can define a specialized FasterRCNN model architecture from the general FasterRCNN application, according to the use cases. Detailed description of this parameter is summarized in the table below.

The augmentation_config defines the data augmentation during the training of a FasterRCNN model. The definition of FasterRCNN data augmentation is identical to that of DetectNet_v2. Check the DetectNet_v2 augmentation_config documentation for the details of this parameter.

The dataset_config defines the dataset of a FasterRCNN experiments (including training dataset and validation dataset). The definition of FasterRCNN dataset is identical to that of DetectNet_v2. Check the DetectNet_v2 dataset_config documentation for the details of this parameter.

The configurations of the dataset, this is the same as dataset_config in DetectNet_v2.

The encoding and decoding key for the TAO models, can be overridden by the command line arguments of tao model faster_rcnn train , tao model faster_rcnn evaluate and tao model faster_rcnn inference .

The experiments specification (spec file for short) defines all the necessary parameters required to in the entire workflow of a FasterRCNN model, from training to export. Below is a sample of the FasterRCNN spec file. The format of the spec file is a protobuf text (prototxt) message and each of its fields can be either a basic data type or a nested proto message. The top level structure of the spec file is summarized in the table below. From the table, we can see the spec file has 9 components: random_seed , verbose , enc_key , dataset_config , augmentation_config , model_config , training_config , inference_config and evaluation_config .

Training the model#

To run training of a FasterRCNN model, use this command:

tao model faster_rcnn train [ -h ] -e <experiment_spec> -r <results_dir> [ -k <enc_key> ] [ --gpus <num_gpus> ] [ --num_processes <number_of_processes> ] [ --gpu_index <gpu_index> ] [ --use_amp ] [ --log_file <log_file_path> ]

Required Arguments# -e, --experiment_spec_file : Experiment specification file to set up the evaluation experiment. This should be the same as training specification file.

-r, --results_dir : Output directory of the training experiment.

Optional Arguments# -h, --help : Show this help message and exit.

-k, --enc_key : TAO encoding key, can override the one in the spec file.

--gpus : The number of GPUs to be used in the training in a multi-GPU scenario (default: 1).

--num_processes, -np : Number of processes to be spawned for training. It defaults to be -1(equal to --gpus , for the use case of data parallelism). In the case of model parallelism, this argument should be explicitly set to 1 or more, depending on the actual scenario. Setting --gpus to be larger than 1 and --num_processes to 1 corresponding to the model parallelism use case; while setting both --gpus and num_processes to be larger than 1 corresponding to the case of enabling both model parallelism and data parallelism. For example, --gpus=4 and --num_processes=2 means 2 horovod processes will be spawned and each of them will occupy 2 GPUs for model parallelism.

--gpu_index : The GPU indices used to run the training. We can specify the GPU indices used to run training when the machine has multiple GPUs installed.

--use_amp : A flag to enable AMP training.

--log_file : Path to the log file. Defaults to stdout.

Input Requirement# Input size : C * W * H (where C = 1 or 3, W >= 128, H >= 128)

Image format : JPG, JPEG, PNG

Label format: KITTI detection

Sample Usage# Here’s an example of using the FasterRCNN training command: tao model faster_rcnn train --gpu_index 0 -e <experiment_spec> -r <results_dir>

Using a Pretrained Model# Usually, using a pretrained model (weights) file for the initial training of FasterRCNN helps get better accuracy. NVIDIA recommends using the pretrained weights provided in NVIDIA GPU Cloud (NGC). FasterRCNN loads the pretrained weights by name. That is, layer by layer, if TAO finds a layer whose name and weights (bias) shape in the pretrained weights file matches a layer in the TAO model, it will load that layer’s weights (and bias, if any) into the model. If some layer in the TAO cannot find a matching layer in the pretrained weights, then TAO will skip that layer and will use random initialization for that layer instead. An exception is that if TAO finds a matching layer in the pretrained weights (and bias, if any) but the shape of the pretrained weights (or bias, if any) in that layer does not match the shape of weights (bias) for the corresponding layer in a TAO model, it will also skip that layer. For some layers that have no weights (bias), nothing will be done for it(hence will be skipped). So, in total, there are three possible statuses to indicate how a layer’s pretrained weights loading is going on: "Yes" means a layer has weights (bias) and is loaded from the pretrained weights file successfully for initialization.

"No" means a layer has weights (bias) but due to mismatched weights (bias) shape(or probably something else), the weights (bias) cannot be loaded successfully and will use random initialization instead.

"None" means a layer has no weights (bias) at all and will not load any weights. In the FasterRCNN training log, there is a table that shows the pretrained weights loading status for each layer in the model. To use a pretrained model in FasterRCNN training, set the pretrained_weights path to point to a pretrained .tlt model (generated with the same encryption key as the FasterRCNN training), a Keras .hdf5 model or a Keras .h5 weights. Note At the start of the training, FasterRCNN will print the pretrained model loading status (per-layer). If facing with bad mAP with the model, we can double check this log to see if the pretrained model is loaded properly or not. Note FasterRCNN does not support loading a non-QAT pruned model and retraining it with QAT enabled. To make the retrained model a QAT model, it is required to do the initial training with QAT enabled too.

Re-training a pruned model# A FasterRCNN model can be retrained one or more times. The typical use case is retraining for a pruned model. To retrain an existing FasterRCNN model, set the retrain_pruned_model path to point to an existing FasterRCNN model.

Resuming an interrupted training# Sometimes a training job can be interrupted due to some reason (e.g., system crash). In these cases, there is no need to redo the training from the start. We can resume the interrupted training from the last checkpoint(saved .tlt model during training). In this case, set the resume_from_model path in spec file to point to the last checkpoint and re-run the training to resume the job.

Input shape: static and dynamic# FasterRCNN training can support both static input shape and dynamic input shape. Static input shape means the input’s width and height are constant numbers like 960 x 544. Static shape is the most commonly used case in practice. To enable static input shape, we should specify it in input_image_config and augmentation_config . We should use size_height_width in input_image_config to specify the input height and width. Again, we should specify the same two numbers in augmentation_config . That is, we specify the output_image_height and output_image_width in augmentation_config . With static input shape, we can offline resize the images to the target resolution or we can enable automatic resize during training. By setting enable_auto_resize in augmentation_config to True we will enable automatic resize during training. Automatic resize will reduce the effort to manually resize the images each time we want to train the model on a different resolution. But since resize happens during training, it will potentially increase the training time. Users should make this tradeoff between offline resize and automatic(online) resize. Dynamic input shape means the input’s height and width are not a constant number but rather can change during training for different images. This kind of input shape is originally proposed in the literature(such as in FasterRCNN paper) where we resize the image and keep aspect ratio such that the resultant image’s smaller side is a given number. Besides the limit on smaller side, we also have a limit on the larger side. If we resize and keep aspect ratio but the resultant image’s larger side’s size exceed this limit on larger side, then we will resize and keep aspect ratio such that the larger side’s size is a given number. In that case, the smaller side will be also no more than its limit. FasterRCNN can support this kind of dynamic input shape. To enable this feature, we have to specify size_min in input_image_config and specify output_image_min and output_image_max in augmentation_confg . size_min and output_image_min indicates the limit of the smaller side’s size, while output_image_max indicates the limit on the larger side’s size. Note that there are some limitations regarding the dynamic shape of FasterRCNN. TAO FasterRCNN training/evaluation/inference can only work with batch size 1.

TAO FasterRCNN export & DeepStream(TensorRT) inference/evaluation does not support dynamic shape for now.