NVIDIA Docs Hub NVIDIA TAO NVIDIA TAO Toolkit v2.0 Creating an Experiment Spec File

Creating an Experiment Spec File

This chapter describes how to create a specification file for model training, inference and evaluation.

Specification File for Classification

Here is an example of a specification file for model classification:

Copy
Copied!

            
            model_config {

  # Model architecture can be chosen from:
  # ['resnet', 'vgg', 'googlenet', 'alexnet', 'mobilenet_v1', 'mobilenet_v2', 'squeezenet', 'darknet', 'googlenet']

  arch: "resnet"

  # for resnet --> n_layers can be [10, 18, 34, 50, 101]
  # for vgg --> n_layers can be [16, 19]
  # for darknet --> n_layers can be [19, 53]


  n_layers: 18
  use_bias: True
  use_batch_norm: True
  all_projections: True
  use_pooling: False
  freeze_bn: False
  freeze_blocks: 0
  freeze_blocks: 1

  # image size should be "3, X, Y", where X,Y >= 16
  input_image_size: "3,224,224"
}

eval_config {
  eval_dataset_path: "/path/to/your/eval/data"
  model_path: "/path/to/your/model"
  top_k: 3
  batch_size: 256
  n_workers: 8

}

train_config {
  train_dataset_path: "/path/to/your/train/data"
  val_dataset_path: "/path/to/your/val/data"
  pretrained_model_path: "/path/to/your/pretrained/model"
  # optimizer can be chosen from ['adam', 'sgd']

  optimizer: "sgd"
  batch_size_per_gpu: 256
  n_epochs: 80
  n_workers: 16

  # regularizer
  reg_config {
    type: "L2"
    scope: "Conv2D,Dense"
    weight_decay: 0.00005

  }

  # learning_rate

  lr_config {

    # "step" and "soft_anneal" are supported.

    scheduler: "soft_anneal"

    # "soft_anneal" stands for soft annealing learning rate scheduler.
    # the following 4 parameters should be specified if "soft_anneal" is used.
    learning_rate: 0.005
    soft_start: 0.056
    annealing_points: "0.3, 0.6, 0.8"
    annealing_divider: 10
    # "step" stands for step learning rate scheduler.
    # the following 3 parameters should be specified if "step" is used.
    # learning_rate: 0.006
    # step_size: 10
    # gamma: 0.1

    # "cosine" stands for soft start cosine learning rate scheduler.
    # the following 2 parameters should be specified if "cosine" is used.
    # learning_rate: 0.05
    # soft_start: 0.01

  }
}

The classification experiment specification can be used with the tlt-train and tlt-evaluate commands. It consists of three main components:

model_config
eval_config
train_config

Model Config

The table below describes the configurable parameters in the model_config.

Parameter	Datatype	Default	Description	Supported Values
all_projections	bool	False	For templates with shortcut connections, this parameter defines whether or not all shortcuts should be instantiated with 1x1 projection layers irrespective of whether there is a change in stride across the input and output.	True/False (only to be used in resnet templates)
arch	string	resnet	This defines the architecture of the back bone feature extractor to be used to train.	resnet vgg mobilenet_v1 mobilenet_v2 googlenet
num_layers	int	18	Depth of the feature extractor for scalable templates.	resnets: 10, 18, 34, 50, 101 vgg: 16, 19
use_pooling	Boolean	False	Choose between using strided convolutions or MaxPooling while downsampling. When true, MaxPooling is used to down sample, however for the object detection network, NVIDIA recommends setting this to False and using strided convolutions.	False/True
use_batch_norm	Boolean	False	Boolean variable to use batch normalization layers or not.	True/False
freeze_blocks	float (repeated)		This parameter defines which blocks may be frozen from the instantiated feature extractor template, and is different for different feature extractor templates.	ResNet series: For the ResNet series, the block ID’s valid for freezing is any subset of [0, 1, 2, 3](inclusive) VGG series: For the VGG series, the block ID’s valid for freezing is any subset of [1, 2, 3, 4, 5](inclusive) MobileNet V1: For the MobileNet V1, the block ID’s valid for freezing is any subset of [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11](inclusive) MobileNet V2: For the MobileNet V2, the block ID’s valid for freezing is any subset of [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13](inclusive) GoogLeNet: For the GoogLeNet, the block ID’s valid for freezing is any subset of [0, 1, 2, 3, 4, 5, 6, 7](inclusive)
freeze_bn	Boolean	False	You can choose to freeze the Batch Normalizationlayers in the model during training.	True/False
freeze_bn	Boolean	False	You can choose to freeze the Batch Normalizationlayers in the model during training.	True/False
input_image_size	String	“3,224,224”	The dimension of the input layer of the model. Images in the dataset will be resized to this shape by the dataloader, when fed to the model for training.	“C,X,Y”, where C=1 or C=3 and X,Y >=16 and X,Y are integers.

Eval Config

The table below defines the configurable parameters for evaluating a classification model.

Parameter	Datatype	Default	Description	Supported Values
eval_dataset_path	string		UNIX format path to the root directory of the evaluation dataset.	UNIX format path.
model_path	string		UNIX format path to the root directory of the model file you would like to evaluate.	UNIX format path.
top_k	int	5	The number elements to look at when calculating the top-K classification categorical accuracy metric.	1, 3, 5
conf_threshold	float	0.5	The confidence threshold of the argmax of the classifier output to be considered as a true positive.	>0.0
batch_size	int	256	Number of images per batch when evaluating the model.	>1 (bound by the number of images that can be fit in the GPU memory)
n_workers	int	8	Number of workers fetching batches of images in the evaluation dataloader.	>1

Training Config

This section defines the configurable parameters for the classification model trainer.

Parameter	Datatype	Default	Description	Supported Values
val_dataset_path	string		UNIX format path to the root directory of the evaluation dataset.	UNIX format path.
train_dataset_path	string		UNIX format path to the root directory of the evaluation dataset.	UNIX format path.
pretrained_model_path	string		UNIX format path to the model file containing the pretrained weights to initialize the model from.	UNIX format path.
batch_size_per_gpu	int	32	This parameter defines the number of images per batch per gpu.	>1
num_epochs	int	120	This parameter defines the total number of epochs to run the experiment.
n_workers	int	False	Number of workers fetching batches of images in the evaluation dataloader.	>1
learning rate	learning rate scheduler proto		This nested protobuf txt parameter defines the learning rate schedule to be used with the trainer, when training a classification model. The following parameters are required to configure a valid learning rate scheduler. scheduler (str): This parameter defines the type of learning rate scheduler to be used. The supported types include: “cosine”, “soft_anneal”, “step” learning_rate (float): The starting learning rate of the learning rate scheduler. soft_start(float): The time (in ratio of the total number of epochs) taken to reach the max learning rate (learning_rate * num_gpus) This parameter should be used if the scheduler is set to “cosine” or “soft_anneal”. annealing_points(string): The times (in ratio of the total number of epochs) at which the learning rate will be divided by the annealing divider. To be used only if the scheduler is set to “soft_anneal”. annealing_divider(float): A divider for learning rate applied at each annealing point. To be used only if the scheduler is set to “soft_anneal”. step(float): The time (in ratio of the total number of epochs) to step the learning rate from lr to lr * gamma To be used only if the scheduler is set to “step”. gamma(float): The scale factor applied to the learning rate after every step. To be used only if the scheduler is set to “step”.	0.0 - 1.0 > 1.0 0.0 - 1.0 0.0 - 1.0
regularizer	regularizer proto config		This parameter configures the type and the weight of the regularizer to be used during training. The three parameters include: type(string) : The type of the regularizer being used. weight_decay(float) : The floating point weight of the regularizer. scope (str): Comma separated types of layers to which regularization must be applied. TLT recommends using regularizer with the Conv2D and Dense layers of a Deep Neural Network.	The supported values for type are: * L1, L2, None >0.0 “Conv2D,Dense”
optimizer	string	sgd	This parameter defines which optimizer to use for training_config	[adam, sgd]

Specification File for DetectNet_v2

To do training, evaluation and inference for DetectNet_v2, several components need to be configured, each with their own parameters. The tlt-train and tlt-evaluate commands for a DetectNet_v2 experiment share the same configuration file. The tlt-infer command uses a separate configuration file.

The training and inference tools use a specification file for object detection. The specification file for detection training configures these components of the training pipe:

Model
BBox ground truth generation
Post processing module
Cost function configuration
Trainer
Augmentation model
Evaluator
Dataloader

Model Config

Core object detection can be configured using the model_config option in the spec file.

Here’s a sample model config to instantiate a resnet18 model with pretrained weights and freeze blocks 0 and 1, with all shortcuts being set to projection layers.

Copy
Copied!

            
            # Sample model config for to instantiate a resnet18 model with pretrained weights and freeze blocks 0, 1
# with all shortcuts having projection layers.
model_config {
  arch: "resnet"
  pretrained_model_file: <path_to_model_file>
  freeze_blocks: 0
  freeze_blocks: 1
  all_projections: True
  num_layers: 18
  use_pooling: False
  use_batch_norm: True
  dropout_rate: 0.0
  training_precision: {
    backend_floatx: FLOAT32
  }
  objective_set: {
    cov {}
    bbox {
      scale: 35.0
      offset: 0.5
    }
  }
}

The following table describes the model_config parameters:

Parameter	Datatype	Default	Description	Supported Values
all_projections	bool	False	For templates with shortcut connections, this parameter defines whether or not all shortcuts should be instantiated with 1x1 projection layers irrespective of whether there is a change in stride across the input and output.	True/False (only to be used in resnet templates)
arch	string	resnet	This defines the architecture of the back bone feature extractor to be used to train.	resnet vgg mobilenet _v1 mobilenet _v2 googlenet
num_layers	int	18	Depth of the feature extractor for scalable templates.	resnets: 10, 18, 34, 50, 101 vgg: 16, 19
pretrained model file	string		This parameter defines the path to a pretrained tlt model file. If the `load_graph flag` is set to `false`, it is assumed that only the weights of the pretrained model file is to be used. In this case, TLT train constructs the feature extractor graph in the experiment and loads the weights from the pretrained model file whose layer names match. Thus, transfer learning across different resolutions and domains are supported. For layers that may be absent in the pretrained model, the tool initializes them with random weights and skips import for that layer.	Unix path
use_pooling	Boolean	False	Choose between using strided convolutions or MaxPooling while downsampling. When true, MaxPooling is used to down sample, however for the object detection network, NVIDIA recommends setting this to False and using strided convolutions.	False/True
use_batch_norm	Boolean	False	Boolean variable to use batch normalization layers or not.	True/False
objective_set	Proto Dictionary		This defines what objectives is this network being trained for. For object detection networks, set it to learn cov and bbox. These parameters should not be altered for the current training pipeline.	cov {} bbox { scale: 35.0 offset: 0.5 }
dropout_rate	Float	0.0	Probability for drop out	0.0-0.1
training precision	Proto Dictionary		Contains a nested parameter that sets the precision of the back-end training framework.	backend_floatx: FLOAT32
load_graph	Boolean	False	Flag to define whether or not to load the graph from the pretrained model file, or just the weights. For a pruned, please remember to set this parameter as True. Pruning modifies the original graph, hence the pruned model graph and the weights need to be imported.	True/False
freeze_blocks	float (repeated)		This parameter defines which blocks may be frozen from the instantiated feature extractor template, and is different for different feature extractor templates.	ResNet series: For the ResNet series, the block ID’s valid for freezing is any subset of [0, 1, 2, 3](inclusive) VGG series: For the VGG series, the block ID’s valid for freezing is any subset of [1, 2, 3, 4, 5](inclusive) MobileNet V1: For the MobileNet V1, the block ID’s valid for freezing is any subset of [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11](inclusive) MobileNet V2: For the MobileNet V2, the block ID’s valid for freezing is any subset of [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13](inclusive) GoogLeNet: For the GoogLeNet, the block ID’s valid for freezing is any subset of [0, 1, 2, 3, 4, 5, 6, 7](inclusive)
freeze_bn	Boolean	False	You can choose to freeze the Batch Normalizationlayers in the model during training.	True/False

BBox Ground Truth Generator

DetectNet_v2 generates 2 tensors, cov and bbox. The image is divided into 16x16 grid cells. The cov tensor(short for coverage tensor) defines the number of gridcells that are covered by an object. The bbox tensor defines the normalized image coordinates of the object (x1, y1) top_left and (x2, y2) bottom right with respect to the grid cell. For best results, you can assume the coverage area to be an ellipse within the bbox label, with the maximum confidence being assigned to the cells in the center and reducing coverage outwards. Each class has its own coverage and bbox tensor, thus the shape of the tensors are:

cov: Batch_size, Num_classes, image_height/16, image_width/16
bbox: Batch_size, Num_classes * 4, image_height/16, image_width/16 (where 4 is the number of coordinates per cell)

Here is a sample rasterizer config for a 3 class detector:

Copy
Copied!

            
            # Sample rasterizer configs to instantiate a 3 class bbox rasterizer
bbox_rasterizer_config {
  target_class_config {
    key: "car"
    value: {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 0.4
      cov_radius_y: 0.4
      bbox_min_radius: 1.0
    }
  }
  target_class_config {
    key: "cyclist"
    value: {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 0.4
      cov_radius_y: 0.4
      bbox_min_radius: 1.0
    }
  }
  target_class_config {
    key: "pedestrian"
    value: {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 0.4
      cov_radius_y: 0.4
      bbox_min_radius: 1.0
    }
  }
  deadzone_radius: 0.67
}

The bbox_rasterizer has the following parameters that are configurable:

Parameter	Datatype	Default	Description	Supported Values
deadzone radius	float	0.67	The area to be considered as dormant (or area of no bboxes) around the ellipse of an object. This is particularly useful in cases of overlapping objects, so that foreground objects and background objects are not confused.	0-1.0
target_class_config	proto dictionary		This is a nested configuration field that defines the coverage region for an object of a given class. For each class, this field is repeated. The configurable parameters of the target_class_config include: cov_center_x (float): x-coordinate of the center of the object. cov_center_y (float): y-coordinate of the center of the object. cov_radius_x (float): x-radius of the coverage ellipse cov_radius_y (float): y-radius of the coverage ellipse bbox_min_radius (float):minimum radius of the coverage region to be drawn for boxes.	cov_center_x: 0.0 - 1.0 cov_center_y: 0.0 - 1.0 cov_radius_x: 0.0 - 1.0 cov_radius_y: 0.0 - 1.0 bbox_min_radius: 0.0 - 1.0

Parameter

Datatype

Default

Description

Supported Values

deadzone radius

float

0.67

The area to be considered as dormant (or area of no bboxes) around the ellipse of an object. This is particularly useful in cases of overlapping objects, so that foreground objects and background objects are not confused.

0-1.0

target_class_config

proto dictionary

This is a nested configuration field that defines the coverage region for an object of a given class. For each class, this field is repeated. The configurable parameters of the target_class_config include:

cov_center_x (float): x-coordinate of the center of the object.
cov_center_y (float): y-coordinate of the center of the object.
cov_radius_x (float): x-radius of the coverage ellipse
cov_radius_y (float): y-radius of the coverage ellipse
bbox_min_radius (float):minimum radius of the coverage region to be drawn for boxes.

cov_center_x: 0.0 - 1.0
cov_center_y: 0.0 - 1.0
cov_radius_x: 0.0 - 1.0
cov_radius_y: 0.0 - 1.0
bbox_min_radius: 0.0 - 1.0

Post processor

The post processor module generates renderable bounding boxes from the raw detection output. The process includes:

Filtering out valid detections by thresholding objects using the confidence value in the coverage tensor
Clustering the raw filtered predictions using DBSCAN to produce the final rendered bounding boxes
Filtering out weaker clusters based on the final confidence threshold derived from the candidate boxes that get grouped into a cluster

Here is an example of the definition of the postprocessor for a 3 class network learning for car, cyclist, and pedestrian:

Copy
Copied!

            
            postprocessing_config {
  target_class_config {
    key: "car"
    value: {
      clustering_config {
        coverage_threshold: 0.005
        dbscan_eps: 0.15
        dbscan_min_samples: 0.05
        minimum_bounding_box_height: 20
      }
    }
  }
  target_class_config {
    key: "cyclist"
    value: {
      clustering_config {
        coverage_threshold: 0.005
        dbscan_eps: 0.15
        dbscan_min_samples: 0.05
        minimum_bounding_box_height: 20
      }
    }
  }
  target_class_config {
    key: "pedestrian"
    value: {
      clustering_config {
        coverage_threshold: 0.005
        dbscan_eps: 0.15
        dbscan_min_samples: 0.05
        minimum_bounding_box_height: 20
      }
    }
 }
}

This section defines parameters that configure the post processor. For each class you can train for, the postprocessing_config has a target_class_config element, which defines the clustering parameters for this class. The parameters for each target class include:

Parameter	Datatype	Default	Description	Supported Values
key	string		The names of the class for which the post processor module is being configured.	The network object class name, which are mentioned in the cost_function_config.
value	clustering _config proto		The nested clustering config proto parameter that configures the postprocessor module. The parameters for this module are defined in the next table.	Encapsulated object with parameters defined below.

The clustering_config element configures the clustering block for this class. Here are the parameters for this element:

Parameter	Datatype	Default	Description	Supported Values
coverate_threshold	float		The minimum threshold of the coverage tensor output to be considered as a valid candidate box for clustering. The 4 coordinates from the bbox tensor at the corresponding indices are passed for clustering.	0.0 - 1.0
dbscan_eps	float		The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. The greater the eps, more boxes are grouped together.	0.0 - 1.0
dbscan_min_samples	float		The total weight in a neighborhood for a point to be considered as a core point. This includes the point itself.	0.0 - 1.0
minimum_bounding_box_height	int		Minimum height in pixels to consider as a valid detection post clustering.	0 - input image height

Cost Function

This section helps you configure the cost function to include the classes that you are training for. For each class you want to train, add a new entry of the target classes to the spec file. NVIDIA recommends not changing the parameters within the spec file for best performance with these classes. The other parameters remain unchanged here.

Copy
Copied!

            
            cost_function_config {
  target_classes {
    name: "car"
    class_weight: 1.0
    coverage_foreground_weight: 0.05
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
    }
  }
  target_classes {
    name: "cyclist"
    class_weight: 1.0
    coverage_foreground_weight: 0.05
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 1.0
    }
  }
  target_classes {
    name: "pedestrian"
    class_weight: 1.0
    coverage_foreground_weight: 0.05
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
    }
  }
  enable_autoweighting: True
  max_objective_weight: 0.9999
  min_objective_weight: 0.0001
}

Trainer

Here’s a sample training_config block to configure a detectnet_v2 trainer:

Copy
Copied!

            
            training_config {
  batch_size_per_gpu: 16
  num_epochs: 80
  learning_rate {
    soft_start_annealing_schedule {
      min_learning_rate: 5e-6
      max_learning_rate: 5e-4
      soft_start: 0.1
      annealing: 0.7
    }
  }
  regularizer {
    type: L1
    weight: 3e-9
  }
  optimizer {
    adam {
      epsilon: 1e-08
      beta1: 0.9
      beta2: 0.999
    }
  }
  cost_scaling {
    enabled: False
    initial_exponent: 20.0
    increment: 0.005
    decrement: 1.0
  }
}

The following table describes the parameters used to configure the trainer:

Parameter	Datatype	Default	Description	Supported Values
batch_size_per _gpu	int	32	This parameter defines the number of images per batch per gpu.	>1
num_epochs	int	120	This parameter defines the total number of epochs to run the experiment.
enable_qat	bool	False	This parameter enables training a model using Quantization Aware Training (QAT). For more information about QAT see Quantization Aware Training.	True, False
learning rate	learning rate scheduler proto	soft_start _annealing _schedule	This parameter configures the learning rate schedule for the trainer. Currently detectnet_v2 only supports softstart annealing learning rate schedule, and maybe configured using the following parameters: soft_start (float): Defines the time to ramp up the learning rate from minimum learning rate to maximum learning rate annealing (float): Defines the time to cool down the learning rate from maximum learning rate to minimum learning rate minimum_learning_rate(float): Minimum learning rate in the learning rate schedule. maximum_learning_rate(float): Maximum learning rate in the learning rate schedule.	annealing: 0.0-1.0 and greater than soft_start Soft_start: 0.0 - 1.0 A sample lr plot for a soft start of 0.3 and annealing of 0.1 is shown in the figure below.
regularizer	regularizer proto config		This parameter configures the type and the weight of the regularizer to be used during training. The two parameters include: type: The type of the regularizer being used. weight: The floating point weight of the regularizer.	The supported values for type are: NO_REG L1 L2
optimizer	optimizer proto config		This parameter defines which optimizer to use for training, and the parameters to configure it, namely: epsilon (float): Is a very small number to prevent any division by zero in the implementation beta1 (float) beta2 (float)
cost_scaling	costscaling _config		This parameter enables cost scaling during training. Please leave this parameter untouched currently for the detectnet_v2 training pipe.	cost_scaling { enabled: False initial_exponent: 20.0 increment: 0.005 decrement: 1.0 }
checkpoint interval	float	0/10	The interval (in epochs) at which tlt-train saves intermediate models.	0 to num_epochs

Detectnet_v2 currently supports the soft-start annealing learning rate schedule. The learning rate when plotted as a function of the training progress (0.0, 1.0) results in the following curve.

In this experiment, the soft start was set as 0.3 and annealing as 0.7, with minimum learning rate as 5e-6 and a maximum learning rate or base_lr as 5e-4.

Note

NVIDIA suggests using L1 regularizer when training a network before pruning as L1 regularization helps making the network weights more easily pruned. After pruning, when retraining the networks, NVIDIA recommends turning regularization off by setting the regularization type to NO_REG.

Augmentation Module

The augmentation module provides some basic pre-processing and augmentation when training. Here is a sample augmentation_config element:

Copy
Copied!

            
            # Sample augementation config for
augmentation_config {
  preprocessing {
    output_image_width: 960
    output_image_height: 544
    output_image_channel: 3
    min_bbox_width: 1.0
    min_bbox_height: 1.0
  }
  spatial_augmentation {

    hflip_probability: 0.5
    vflip_probability: 0.0
    zoom_min: 1.0
    zoom_max: 1.0
    translate_max_x: 8.0
    translate_max_y: 8.0
  }
  color_augmentation {
    color_shift_stddev: 0.0
    hue_rotation_max: 25.0
    saturation_shift_max: 0.2
    contrast_scale_max: 0.1
    contrast_center: 0.5
  }
}

Note

If the output image height and the output image width of the preprocessing block doesn’t match with the dimensions of the input image, the dataloader either pads with zeros, or crops to fit to the output resolution. It does not resize the input images and labels to fit.

The augmentation_config contains three elements:

preprocessing: This nested field configures the input image and ground truth label pre-processing module. It sets the shape of the input tensor to the network. The ground truth labels are pre-processed to meet the dimensions of the input image tensors.

Parameter	Datatype	Default/Suggested value	Description	Supported Values
output _image _width	int	–	The width of the augmentation output. This is the same as the width of the network input and must be a multiple of 16.	>480
output _image _height	int	–	The height of the augmentation output. This is the same as the height of the network input and must be a multiple of 16.	>272
output _image _channel	int	1, 3	The channel depth of the augmentation output. This is the same as the channel depth of the network input. Currently 1-channel input is not recommended for datasets with jpg images. For png images, both 3 channel RGB and 1 channel monochrome images are supported.	1,3
Min_bbox _height	float		The minimum height of the object labels to be considered for training.	0 - output_image_height
Min_bbox _width	float		The minimum width of the object labels to be considered for training.	0 - output_image_width
crop_right	int		The right boundary of the crop to be extracted from the original image.	0 - input image width
crop_left	int		The left boundary of the crop to be extracted from the original image.	0 - input image width
crop_top	int		The top boundary of the crop to be extracted from the original image.	0 - input image height
crop_bottom	int		The bottom boundary of the crop to be extracted from the original image.	0 - input image height
scale_height	float		The floating point factor to scale the height of the cropped images.	> 0.0
scale_width	float		The floating point factor to scale the width of the cropped images.	> 0.0

spatial_augmentation: This module supports basic spatial augmentation such as flip, zoom and translate which may be configured.

Parameter	Datatype	Default/Suggested value	Description	Supported Values
hflip_probability	float	0.5	The probability to flip an input image horizontally.	0.0-1.0
vflip_probability	float	0.0	The probability to flip an input image vertically.	0.0-1.0
zoom_min	float	1.0	The minimum zoom scale of the input image.	> 0.0
zoom_max	float	1.0	The maximum zoom scale of the input image.	> 0.0
translate_max_x	int	8.0	The maximum translation to be added across the x axis.	0.0 - output_image_width
translate_max_y	int	8.0	The maximum translation to be added across the y axis	0.0 - output_image_height
rotate_rad_max	float	0.69	The angle of rotation to be applied to the images and the training labels. The range is defined between [-rotate_rad_max, rotate_rad_max]	> 0.0 (modulo 2*pi

color_augmentation: This module configures the color space transformations, such as color shift, hue_rotation, saturation shift, and contrast adjustment.

Parameter	Datatype	Default/Suggested value	Description	Supported Values
color_shift_stddev	float	0.0	The standard devidation value for the color shift.	0.0-1.0
hue_rotation_max	float	25.0	The maximum rotation angle for the hue rotation matrix.	0.0-360.0
saturation_shift_max	float	0.2	The maximum shift that changes the saturation. A value of 1.0 means no change in saturation shift.	0.0 - 1.0
contrast_scale_max	float	0.1	The slope of the contrast as rotated around the provided center. A value of 0.0 leaves the contrast unchanged.	0.0 - 1.0
contrast_center	float	0.5	The center around which the contrast is rotated. Ideally this is set to half of the maximum pixel value. (Since our input images are scaled between 0 and 1.0, you can set this value to 0.5).	0.5

The dataloader online augmentation pipeline applies spatial and color-space augmentation transformations in the following order:

The dataloader first performs the pre-processing operations on the input data (image and labels) read from the tfrecords files. Here the images and labels cropped and scaled based on the parameters mentioned in the preprocessing config. The boundaries of generating the cropped image and labels from the original image is defined by the crop_left, crop_right, crop_top and crop_bottom parameters. This cropped data is then scaled by the scale factors defined by scale_height and scale_width. These transformation matrices for these operations are computed globally and do not change per image.
The net tensors generated from the pre-processing blocks are then passed through a pipeline of random augmentations in spatial and color domain. The spatial augmentations are applied to both images and the label coordinates, while the color augmentations are applied only to the images. In order to apply color augmentations, the output_image_channel parameter must be set to 3. For monochrome tensors color augmentations are not applied. The spatial and color transformation matrices are computed per image based on a uniform distribution along the max and min ranges defined by the spatial_augmentation and color_augmentation config parameters.
Once the spatial and color augmented net input tensors are generated, the output is then padded with zeros or clipped along the right and bottom edge of the image to fit the output dimensions defined in the preprocessing config.

Configuring the Evaluator

The evaluator in the detection training pipeline can be configured using the evaluation_config parameters. The following is an example evaluation_config element:

Copy
Copied!

            
            # Sample evaluation config to run evaluation in integrate mode for the given 3 class model,
# at every 10th epoch starting from the epoch 1.
evaluation_config {
  average_precision_mode: INTEGRATE
  validation_period_during_training: 10
  first_validation_epoch: 1
  minimum_detection_ground_truth_overlap {
    key: "car"
    value: 0.7
  }
  minimum_detection_ground_truth_overlap {
    key: "person"
    value: 0.5
  }
  minimum_detection_ground_truth_overlap {
    key: "bicycle"
    value: 0.5
  }
  evaluation_box_config {
    key: "car"
    value {
      minimum_height: 4
      maximum_height: 9999
      minimum_width: 4
      maximum_width: 9999
    }
  }
  evaluation_box_config {
    key: "person"
    value {
      minimum_height: 4
      maximum_height: 9999
      minimum_width: 4
      maximum_width: 9999
    }
  }
  evaluation_box_config {
    key: "bicycle"
    value {
      minimum_height: 4
      maximum_height: 9999
      minimum_width: 4
      maximum_width: 9999
    }
  }
}

The following tables describe the parameters used to configure evaluation:

Parameter	Datatype	Default/Suggested value	Description	Supported Values
average_precision _mode		Sample	The mode in which the average precision for each class is calculated.	SAMPLE: This is the ap calculation mode using 11 evenly spaced recall points as used in the Pascal VOC challenge 2007. INTEGRATE: This is the ap calculation mode as used in the 2011 challenge
validation_period _during_training	int	10	The interval at which evaluation is run during training. The evaluation is run at this interval starting from the value of the first validation epoch parameter as specified below.	1 - total number of epochs
first_validation _epoch	int	30	The first epoch to start running validation. Ideally it is preferred to wait for atleast 20-30% of the total number of epochs before starting evaluation, since the predictions in the initial epochs would be fairly inaccurate. Too many candidate boxes may be sent to clustering and this can cause the evaluation to slow down.	1 - total number of epochs
minimum_detection _ground_truth_overlap	proto dictionary		Minimum IOU between ground truth and predicted box after clustering to call a valid detection. This parameter is a repeatable dictionary, and a separate one must be defined for every class. The members include: key (string): class name value (float): intersection over union value
evaluation_box_config	proto dictionary		This nested configuration field configures the min and max box dimensions to be considered as a valid ground truth and prediction for AP calculation.

The evaluation_box_config field has these configurable inputs.

Parameter	Datatype	Default/Suggested value	Description	Supported Values
minimum_height	float	10	Minimum height in pixels for a valid ground truth and prediction bbox.	model image height
minimum_width	float	10	Minimum width in pixels for a valid ground truth and prediction bbox.	model image width
maximum_height	float	9999	Maximum height in pixels for a valid ground truth and prediction bbox.	minimum_height - model image height
maximum_width	float	9999	Maximum width in pixels for a valid ground truth and prediction bbox.	minimum _width - model image width

Dataloader

The dataloader defines the path to the data you want to train on and the class mapping for classes in the dataset that the network is to be trained for.

The following is an example dataset_config element:

Copy
Copied!

            
            dataset_config {
  data_sources: {
    tfrecords_path: "<path to the training tfrecords root/tfrecords train pattern>"
    image_directory_path: "<path to the training data source>"
  }
  image_extension: "jpg"
  target_class_mapping {
      key: "car"
      value: "car"
  }
  target_class_mapping {
      key: "automobile"
      value: "car"
  }
  target_class_mapping {
      key: "heavy_truck"
      value: "car"
  }
  target_class_mapping {
      key: "person"
      value: "pedestrian"
  }
  target_class_mapping {
      key: "rider"
      value: "cyclist"
  }
  validation_fold: 0
}

In this example the tfrecords is assumed to be multi-fold, and the fold number to validate on is defined. However, evaluation doesn’t necessarily have to be run on a split of the training set. Many ML engineers choose to evaluate the model on a well chosen evaluation dataset that is exclusive of the training dataset. If you prefer to run evaluation on a different validation dataset as opposed to a split from the training dataset, then please convert this dataset into tfrecords as well using the tlt-dataset-convert tool as mentioned here, and use the validation_data_source field in the dataset_config to define this. In this case, please do not forget to remove the validation_fold field from the spec. When generating the TFRecords for evaluation by using the validation_data_source field, please review the notes here.

Copy
Copied!

            
            validation_data_source: {
    tfrecords_path: " <path to tfrecords to validate on>/tfrecords validation pattern>"
    image_directory_path: " <path to validation data source>"
}

The parameters in dataset_config are defined as follows:

data_sources: Captures the path to TFrecords to train on. This field contains 2 parameters:
- tfrecords_path: Path to the individual TFrecords files. This path follows the UNIX style pathname pattern extension, so a common pathname pattern that captures all the tfrecords files in that directory can be used.
- image_directory_path: Path to the training data root from which the tfrecords was generated.
image_extension: Extension of the images to be used.
target_class_mapping: This parameter maps the class names in the tfrecords to the target class to be trained in the network. An element is defined for every source class to target class mapping. This field was included with the intention of grouping similar class objects under one umbrella. For eg: car, van, heavy_truck etc may be grouped under automobile. The “key” field is the value of the class name in the tfrecords file, and “value” field corresponds to the value that the network is expected to learn.
validation_fold: In case of an n fold tfrecords, you define the index of the fold to use for validation. For sequencewise validation please choose the validation fold in the range [0, N-1]. For random split partitioning, please force the validation fold index to 0 as the tfrecord is just 2-fold.

Note

The class names key in the target_class_mapping must be identical to the one shown in the dataset converter log, so that the correct classes are picked up for training.

Specification File for Inference

This spec file configures the tlt-infer tool of detectnet to generate valid bbox predictions. The inference tool consists of 2 blocks, namely the inferencer and the bbox handler. The inferencer instantiates the model object and preprocessing pipe, which the bbox handler handles the post processing, rendering of bounding boxes and the serialization to KITTI format output labels.

Inferencer

The inferencer instantiates a model object that generates the raw predictions from the trained model. The model may be defined to run inference in the TLT backend or the TensorRT backend.

A sample inferencer_config element for the inferencer spec is defined here:

Copy
Copied!

            
            inferencer_config{
  # defining target class names for the experiment.
  # Note: This must be mentioned in order of the networks classes.
  target_classes: "car"
  target_classes: "cyclist"
  target_classes: "pedestrian"
  # Inference dimensions.
  image_width: 1248
  image_height: 384
  # Must match what the model was trained for.
  image_channels: 3
  batch_size: 16
  gpu_index: 0
  # model handler config
  tensorrt_config{
    parser:  ETLT
    etlt_model: "/path/to/model.etlt"
    backend_data_type: INT8
    save_engine: true
    trt_engine: "/path/to/trt/engine/file"
    calibrator_config{
        calibration_cache: "/path/to/calibration/cache"
        n_batches: 10
        batch_size: 16
    }
  }
}

The inferencer_config parameters are explained in the table below.

Parameter	Datatype	Default/Suggested value	Description	Supported Values
target_classes	String (repeated)	None	The names of the target classes the model should output. For a multi-class model this parameter is repeated N times. The number of classes must be equal to the number of classes and the order must be the same as the classes in costfunction_config of the training config file.	For example, for the 3 class kitti model it will be: car cyclist pedestrian
batch_size	int	1	The number of images per batch of inference	Max number of images that can be fit in 1 GPU
image_height	int	384	The height of the image in pixels at which the model will be inferred.	>16
image_width	int	1248	The width of the image in pixels at which the model will be inferred.	>16
image_channels	int	3	The number of channels per image.	1,3
gpu_index	int	0	The index of the GPU to run inference on. This is useful only in TLT inference. For tensorRT inference, by default, the GPU of choice in ‘0’.
tensorrt_config	TensorRTConfig	None	Proto config to instantiate a TensorRT object
tlt_config	TLTConfig	None	Proto config to instantiate a TLT model object.

As mentioned earlier, the tlt-infer tool is capable of running inference using the native TLT backend and the TensorRT backend. They can be configured by using the tensorrt_config proto element, or the tlt_config proto element respectively. You may use only one of the two in a single spec file. The definitions of the two model objects are:

Parameter	Datatype	Default/Suggested value	Description	Supported Values
parser	enum	ETLT	The tensorrt parser to be invoked. Only ETLT parser is supported.	ETLT
etlt_model	string	None	Path to the exported etlt model file.	Any existing etlt file path.
backend_data _type	enum	FP32	The data type of the backend TensorRT inference engine. For int8 mode, please be sure to mention the calibration_cache.	FP32 FP16 INT8
save_engine	bool	False	Flag to save a TensorRT engine from the input etlt file. This will save initialization time if inference needs to be run on the same etlt file and there are no changes needed to be made to the inferencer object.	True, False
trt_engine	string	None	Path to the TensorRT engine file. This acts an I/O parameter. If the path defined here is not an engine file, then the tlt-infer tool creates a new TensorRT engine from the etlt file. If there exists an engine already, the tool, re-instantiates the inferencer from the engine defined here.	UNIX path string
calibration _config	CalibratorConfig Proto	None	This is a required parameter when running in the int8 inference mode. This proto object contains parameters used to define a calibrator object. Namely: calibration_cache: path to the calibration cache file generated using tlt-export

TLT_Config

Parameter	Datatype	Default/Suggested value	Description	Supported Values
model	string	None	The path to the .tlt model file.

Note

Since detectnet is a full convolutional neural net, the model can be inferred at a different inference resolution than the resolution at which it was trained. The input dims of the network will be overridden to run inference at this resolution, if they are different from the training resolution. There may be some regression in accuracy when running inference at a different resolution since the convolutional kernels don’t see the object features at this shape.

Bbox Handler

The bbox handler takes care of the post processing the raw outputs from the inferencer. It performs the following steps:

Thresholding the raw outputs to defines grid-cells where the detections may be present per class.
Reconstructing the image space coordinates from the raw coordinates of the inferencer.
Clustering the raw thresholded predictions.
Filtering the clustered predictions per class.
Rendering the final bounding boxes on the image in its input dimensions and serializing them to KITTI format metadata.

A sample bbox_handler_config element is defined below.

Copy
Copied!

            
            bbox_handler_config{
  kitti_dump: true
  disable_overlay: false
  overlay_linewidth: 2
  classwise_bbox_handler_config{
    key:"car"
    value: {
      confidence_model: "aggregate_cov"
      output_map: "car"
      confidence_threshold: 0.9
      bbox_color{
        R: 0
        G: 255
        B: 0
      }
      clustering_config{
        coverage_threshold: 0.00
        dbscan_eps: 0.3
        dbscan_min_samples: 0.05
        minimum_bounding_box_height: 4
      }
    }
  }
  classwise_bbox_handler_config{
    key:"default"
    value: {
      confidence_model: "aggregate_cov"
      confidence_threshold: 0.9
      bbox_color{
        R: 255
        G: 0
        B: 0
      }
      clustering_config{
        coverage_threshold: 0.00
        dbscan_eps: 0.3
        dbscan_min_samples: 0.05
        minimum_bounding_box_height: 4
      }
    }
  }
}

The parameters to configure the bbox handler are defined below.

Parameter	Datatype	Default/Suggested value	Description	Supported Values
kitti_dump	bool	false	Flag to enable saving the final output predictions per image in KITTI format.	true, false
disable_overlay	bool	true	Flag to disable bbox rendering per image.	true, false
overlay _linewidth	int	1	Thickness in pixels of the bbox boundaries.	>1
classwise_bbox _handler_config	ClasswiseCluster Config (repeated)	None	This is a repeated class-wise dictionary of post-processing parameters. DetectNet_v2 uses dbscan clustering to group raw bboxes to final predictions. For models with several output classes, it may be cumbersome to define a separate dictionary for each class. In such a situation, a default class may be used for all classes in the network.

The classwise_bbox_handler_config is a Proto object containing several parameters to configure the clustering algorithm as well as the bbox renderer.

Parameter	Datatype	Default / Suggested value	Description	Supported Values
confidence _model	string	aggregate_cov	Algorithm to compute the final confidence of the clustered bboxes. In the aggregate_cov mode, the final confidence of a detection is the sum of the confidences of all the candidate bboxes in a cluster. In mean_cov mode, the final confidence is the mean confidence of all the bboxes in the cluster.	aggregate_cov, mean_cov
confidence _threshold	float	0.9 in aggregate_cov mode 0.1 in mean_cov_mode	The threshold applied to the final aggregate confidence values to render the bboxes.	In aggregate_cov: Maybe tuned to any float value > 0.0 In mean_cov: 0.0 - 1.0
bbox_color	BBoxColor Proto Object	None	RGB channel wise color intensity per box.	R: 0 - 255 G: 0 - 255 B: 0 - 255
clustering_config	ClusteringConfig	None	Proto object to configure the DBSCAN clustering algorithm. Contains the following sub parameters. coverage_threshold: The threshold applied to the raw network confidence predictions as a first stage filtering technique. dbscan_eps: (float) The search distance to group together boxes into a single cluster. The lesser the number, the more boxes are detected. Eps of 1.0 groups all boxes into a single cluster. dbscan_min_samples: (float) The weight of the boxes in a cluster. min_bbox_height: (int) The minimum height of the bbox to be clustered.	coverage _threshold: 0.005 dbscan_eps: 0.3 dbscan_min _samples: 0.05 minimum_bounding _box_height: 4

Specification File for FasterRCNN

Below is a sample of the FasterRCNN spec file. It has two major components: network_config and training_config, explained below in detail. The format of the spec file is a protobuf text(prototxt) message and each of its fields can be either a basic data type or a nested message. The top level structure of the spec file is summarized in the table below.

Here’s a sample of the FasterRCNN spec file:

Copy
Copied!

            
            random_seed: 42
enc_key: 'tlt'
verbose: True
network_config {
  input_image_config {
    image_type: RGB
    image_channel_order: 'bgr'
    size_height_width {
      height: 384
      width: 1248
    }
    image_channel_mean {
      key: 'b'
      value: 103.939
    }
    image_channel_mean {
      key: 'g'
      value: 116.779
    }
    image_channel_mean {
      key: 'r'
      value: 123.68
    }
    image_scaling_factor: 1.0
    max_objects_num_per_image: 100
  }
  feature_extractor: "resnet:18"
  anchor_box_config {
    scale: 64.0
    scale: 128.0
    scale: 256.0
    ratio: 1.0
    ratio: 0.5
    ratio: 2.0
  }
  freeze_bn: True
  freeze_blocks: 0
  freeze_blocks: 1
  roi_mini_batch: 256
  rpn_stride: 16
  conv_bn_share_bias: False
  roi_pooling_config {
    pool_size: 7
    pool_size_2x: False
  }
  all_projections: True
  use_pooling:False
}
training_config {
  kitti_data_config {
    data_sources: {
      tfrecords_path: "/workspace/tlt-experiments/tfrecords/kitti_trainval/kitti_trainval*"
      image_directory_path: "/workspace/tlt-experiments/data/training"
    }
    image_extension: 'png'
    target_class_mapping {
      key: 'car'
      value: 'car'
    }
    target_class_mapping {
      key: 'van'
      value: 'car'
    }
    target_class_mapping {
      key: 'pedestrian'
      value: 'person'
    }
    target_class_mapping {
      key: 'person_sitting'
      value: 'person'
    }
    target_class_mapping {
      key: 'cyclist'
      value: 'cyclist'
    }
    validation_fold: 0
  }
  data_augmentation {
    preprocessing {
      output_image_width: 1248
      output_image_height: 384
      output_image_channel: 3
      min_bbox_width: 1.0
      min_bbox_height: 1.0
    }
    spatial_augmentation {
      hflip_probability: 0.5
      vflip_probability: 0.0
      zoom_min: 1.0
      zoom_max: 1.0
      translate_max_x: 0
      translate_max_y: 0
    }
    color_augmentation {
      hue_rotation_max: 0.0
      saturation_shift_max: 0.0
      contrast_scale_max: 0.0
      contrast_center: 0.5
    }
  }
  enable_augmentation: True
  batch_size_per_gpu: 16
  num_epochs: 12
  pretrained_weights: "/workspace/tlt-experiments/data/faster_rcnn/resnet18.h5"
  #resume_from_model: "/workspace/tlt-experiments/data/faster_rcnn/resnet18.epoch2.tlt"
  #retrain_pruned_model: "/workspace/tlt-experiments/data/faster_rcnn/model_1_pruned.tlt"
  output_model: "/workspace/tlt-experiments/data/faster_rcnn/frcnn_kitti_resnet18.tlt"
  rpn_min_overlap: 0.3
  rpn_max_overlap: 0.7
  classifier_min_overlap: 0.0
  classifier_max_overlap: 0.5
  gt_as_roi: False
  std_scaling: 1.0
  classifier_regr_std {
    key: 'x'
    value: 10.0
  }
  classifier_regr_std {
    key: 'y'
    value: 10.0
  }
  classifier_regr_std {
    key: 'w'
    value: 5.0
  }
  classifier_regr_std {
    key: 'h'
    value: 5.0
  }
  rpn_mini_batch: 256
  rpn_pre_nms_top_N: 12000
  rpn_nms_max_boxes: 2000
  rpn_nms_overlap_threshold: 0.7
  reg_config {
    reg_type: 'L2'
    weight_decay: 1e-4
  }
  optimizer {
    adam {
      lr: 0.00001
      beta_1: 0.9
      beta_2: 0.999
      decay: 0.0
    }
  }
  lr_scheduler {
    step {
      base_lr: 0.00016
      gamma: 1.0
      step_size: 30
    }
  }
  lambda_rpn_regr: 1.0
  lambda_rpn_class: 1.0
  lambda_cls_regr: 1.0
  lambda_cls_class: 1.0
  inference_config {
    images_dir: '/workspace/tlt-experiments/data/testing/image_2'
    model: '/workspace/tlt-experiments/data/faster_rcnn/frcnn_kitti_resnet18.epoch12.tlt'
    detection_image_output_dir: '/workspace/tlt-experiments/data/faster_rcnn/inference_results_imgs'
    labels_dump_dir: '/workspace/tlt-experiments/data/faster_rcnn/inference_dump_labels'
    rpn_pre_nms_top_N: 6000
    rpn_nms_max_boxes: 300
    rpn_nms_overlap_threshold: 0.7
    bbox_visualize_threshold: 0.6
    classifier_nms_max_boxes: 300
    classifier_nms_overlap_threshold: 0.3
  }
  evaluation_config {
    model: '/workspace/tlt-experiments/data/faster_rcnn/frcnn_kitti_resnet18.epoch12.tlt'
    labels_dump_dir: '/workspace/tlt-experiments/data/faster_rcnn/test_dump_labels'
    rpn_pre_nms_top_N: 6000
    rpn_nms_max_boxes: 300
    rpn_nms_overlap_threshold: 0.7
    classifier_nms_max_boxes: 300
    classifier_nms_overlap_threshold: 0.3
    object_confidence_thres: 0.0001
    use_voc07_11point_metric:False
  }
}

Parameter	Datatype	Default Value	Description	Supported Values
random_seed	The random seed for the experiment.	Unsigned int	42
enc_key	The encoding and decoding key for the TLT models, can be override by the command line arguments of tlt-train, tlt-evaluate and tlt-infer for FasterRCNN.	Str, should not be empty
verbose	Controls the logging level during the experiments. Will print more logs if True.	Boolean(True or False)	False
network_config	The architecture of the model and its input format.	message
training_config	The configurations for the training, evaluation and inference for this experiment.	message

Network Config

The network config(network_config) defines the model structure and the its input format. This model is used for training, evaluation and inference. Detailed description is summarized in the table below.

Field	Description	Data Type and Constraints	Recommended/Typical Value
input_image_config	Defines the input image format, including the image channel number, channel order, width and height, and the preprocessings (subtract per-channel mean and divided by a scaling factor) for it before feeding input the model. See below for details.	message
input_image _config. image_type	The image type, can be either RGB or gray-scale image.	enum type. Either RGB or GRAYSCALE	RGB
input_image _config. image_channel_order	The image channel order.	str type. If image_type is RGB, ‘rgb’ or ‘bgr’ is valid. If the image_type is GRAYSCALE, only ‘l’ is valid.	‘bgr’
input_image_config. size_height_width	The height and width as the input dimension of the model.	message
input_image _config. image_channel_mean	Per-channel mean value to subtract by for the image preprocessing.	map(dict) type from channel names to the corresponding mean values. Each of the mean values should be non-negative	Copy Copied! `image_channel_mean { key: 'b' value: 103.939 } image_channel_mean { key: 'g' value: 116.779 } image_channel_mean { key: 'r' value: 123.68 }`
input_image _config. image _scaling_factor	Scaling factor to divide by for the image preprocessing.	float type, should be a positive scalar.	1.0
input_image _config. max_objects _num_per_image	The maximum number of objects in an image for the dataset. Usually, the number of objects in different images is different, but there is a maximum number. Setting this field to be no less than this maximum number. This field is used to pad the objects number to the same value so you can make multi-batch and multi-gpu training of FasterRCNN possible.	unsigned int, should be positive.	100
feature_extractor	The feature extractor(backbone) for the FasterRCNN model. FasterRCNN supports 12 backbones. Note: FasterRCNN actually supports another backbone: vgg. This backbone is a VGG16 backbone exactly the same as in Keras applications. The layer names matter when loading a pretrained weights. If you want to load a pretrained weights that has the same names as VGG16 in the Keras applications, you should use this backbone. Since this is indeed duplicated with the vgg:16 backbone, you might consider using vgg:16 for production. The only use case for the vgg backbone is to reproduce the original Caffe implementation of VGG16 FasterRCNN that uses ImageNet weights as pretrained weights.	str type. The architecture can be ResNet, VGG , GoogLeNet, MobileNet or DarkNet. For each specific architecture, it can have different layers or versions. Details listed below. ResNet series: resnet:10, resnet:18, resnet:34, resnet:50, resnet:101 VGG series: vgg:16, vgg:19 GoogLeNet: googlenet MobileNet series: mobilenet_v1, mobilenet_v2 DarkNet: darknet:19, darknet:53 Here a notational convention can be used, i.e., for models that can have different numbers of layers, use a colon followed by the layer number as the suffix of the model name. E.g., resnet:<layer_number>
anchor_box_config	The anchor box configuration defines the set of anchor box sizes and aspect ratios in a FasterRCNN model.	Message type that contains two sub-fields: scale and ratio. Each of them is a list of floating point numbers. The scale field defines the absolute anchor sizes in pixels(at input image resolution). The ratio field defines the aspect ratios of each anchor.
freeze_bn	whether or not to freeze all the BatchNormalization layers in the model. You can choose to freeze the BatchNormalization layers in the model during training. This is a common trick when training a FasterRCNN model. Note: Freezing the BatchNormalization layer will only freeze the moving mean and moving variance in it, while the gamma and beta parameters are still trainable.	Boolean (True or False)	If you train with a small batch size, usually you need to set the field to be True and use good pretrained weights to make the training converge well. But if you train with a large batch size(e.g., >=16), you can set it to be False and let the BatchNormalization layer to calculate the moving mean and moving variance by itself.
freeze_blocks	The list of block IDs to be frozen in the model during training. You can choose to freeze some of the CNN blocks in the model to make the training more stable and/or easier to converge. The definition of a block is heuristic for a specific architecture. For example, by stride or by logical blocks in the model, etc. However, the block ID numbers identify the blocks in the model in a sequential order so you don’t have to know the exact locations of the blocks when you do training. A general principle to keep in mind is: the smaller the block ID, the closer it is to the model input; the larger the block ID, the closer it is to the model output. You can divide the whole model into several blocks and optionally freeze a subset of it. Note that for FasterRCNN you can only freeze the blocks that are before the ROI pooling layer. Any layer after the ROI pooling layer will not be frozen any way. For different backbones, the number of blocks and the block ID for each block are different. It deserves some detailed explanations on how to specify the block ID’s for each backbone.	list(repeated integers) ResNet series - For the ResNet series, the block IDs valid for freezing is any subset of [0, 1, 2, 3](inclusive) VGG series - For the VGG series, the block IDs valid for freezing is any subset of[1, 2, 3, 4, 5](inclusive) GoogLeNet- For the GoogLeNet, the block IDs valid for freezing is any subset of[0, 1, 2, 3, 4, 5, 6, 7](inclusive) MobileNet V1- For the MobileNet V1, the block IDs valid for freezing is any subset of [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11](inclusive) MobileNet V2- For the MobileNet V2, the block IDs valid for freezing is any subset of [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13](inclusive) DarkNet - For the DarkNet 19 and DarkNet 53, the block IDs valid for freezing is any subset of [0, 1, 2, 3, 4,5](inclusive)	Leave it empty([])
roi_mini_batch	The batch size used to train the RCNN after ROI pooling.	A positive integer, usually uses 128, 256, etc.	256
RPN_stride	The cumulative stride from the model input to the RPN. This value is fixed(16) for current implementation.	positive integer	16
conv_bn_share_bias	A Boolean value to indicate whether or not to share the bias of the convolution layer and the BatchNormalization (BN) layer immediately after it. Usually you share the bias between them to reduce the model size and avoid redundancy of parameters. When using the pretrained weights, make sure the value of this parameter aligns with the actual configuration in the pretrained weights otherwise error will be raised when loading the pretrained weights.	Boolean (True or False)	True
roi_pooling_config	The configuration for the ROI pooling layer.	Message type that contains two sub-fields: pool_size and pool_size_2x. See below for details.
roi_pooling_config. pool_size	The output spatial size(height and width) of ROIs. Only square spatial size is supported currently, i.e. height=width.	unsigned int, should be positive.	7
roi_pooling_config. pool _size_2x	A Boolean value to indicate whether to do the ROI pooling at 2*pool_size followed by a 2 x 2 pooling operation or do ROI pooling directly at pool_size without pooling operation. E.g. if pool_size = 7, and pool_size_2x=True, it means you do ROI pooling to get an output that has a spatial size of 14 x 14 followed by a 2 x 2 pooling operation to get the final output tensor.	Boolean (True or False)
all_projections	This field is only useful for models that have shortcuts in it. These models include ResNet series and the MobileNet V2. If all_projections =True, all the pass-through shortcuts will be replaced by a projection layer that has the same number of output channels as it.	Boolean (True or False)	True
use_pooling	This parameter is only useful for VGG series and ResNet series. When use_pooling=True, you can use pooling in the model as the original implementation, otherwise use strided convolution to replace the pooling operations in the model. If you want to improve the inference FPS(Frame Per Second) performance, you can try to set use_pooling=False.	Boolean (True or False)	False

Training Configuration

The training configuration(training_config) defines the parameters needed for the training, evaluation and inference. Details are summarized in the table below.

Field	Description	Data Type and Constraints	Recommended/Typical Value
kitti_data_config	The dataset used for training, evaluation and inference.	Message type. It has the same structure as the dataset_config message in DetectNet_v2 spec file. Refer to the DetectNet_v2 dataset_config documentation for the details.
data_augmentation	Defines the data augmentation pipeline during training.	Message type. It has the same structure as the data_augmentation message in the DetectNet_v2 spec file. Refer to the DetectNet_v2 data_augmentation documentation for the details.
enable_augmentation	Whether or not to enable the data augmentation during training. If this parameter is False, the training will not have any data augmentation operation even if you have already defined the data augmentation pipeline in the data_augmentation field in spec file. This feature is mostly used for debugging of the data augmentation pipeline.	Boolean(True or False)	True
batch_size_pe_gpu	The training batch size on each GPU device. The actual total batch size will be batch_size_per_gpu multiplied by the number of GPUs in a multi-gpu training scenario.	unsigned int, positive.	Change the batch_size_per _gpu to adapt the capability of your GPU device.
num_epochs	The number of epochs for the training.	unsigned int, positive.	20
pretrained_weights	Absolute path to the pretrained weights file used to initialize the training model. The pretrained weights file can be either a Keras weights file (with .h5 suffix), a Keras model file (with .hdf5 suffix) or a TLT model (with .tlt suffix, trained by TLT). If the file is a model file (.tlt or .hdf5), TLT will extract the weights from it and then load the weights for initialization. Files with any other formats are not supported as pretrained weights. Note that the pretrained weights file is agnostic to the input dimensions of the FasterRCNN model so the model you are training can have different input dimensions from the input dimensions specified in the pretrained weights. Normally, the pretrained weights file is only useful during the initial training phase in a TLT workflow.	Str type. Can be left empty. In that case, the FasterRCNN model will use random initialization for its weights. Usually, FasterRCNN model needs a pretrained weights for good convergence of training.
resume_from_model	Absolute path to the checkpoint .tlt model that you want to resume the training from. This is useful in some cases when the training process is interrupted for some reason and you don’t want to redo the training from epoch 0(or 1 in 1-based indexing). In that case, you can use the last checkpoint as the model you will resume from, to save the training time.	Str type. Leave it empty when you are not resuming the training, i.e., train from epoch 0.
retrain_pruned_model	Path to the pruned model that you can load and do the retraining. This is used in the retraining phase in a TLT workflow. The model is the output model of the pruning phase.	Str type. Leave it empty when you are not in the retraining phase.
output_model	Absolute path to the output .tlt model that the training/retraining will save. Note that this path is not the actual path of the .tlt models. For example, if the output_model is ‘/workspace/tlt_training/resnet18.tlt’, then the actual output model path will be ‘/workspace/tlt_training/resnet18 .epoch<k>.tlt’where <k> denotes the epoch number of during training. In this way, you can distinguish the output models for different epochs. Here, the epoch number <k> is a 1-based index.	Str type. Cannot be empty.
checkpoint_interval	The epoch interval that controls how frequent TLT will save the checkpoint during training. TLT will save the checkpoint at every checkpoint _interval epoch(1 based index). For example, if the num_epochs is 12 and checkpoint _interval is 3, then TLT will save checkpoint at the end of epoch 3, 6, 9, and 12. If this parameter is not specified, then it defaults to checkpoint _interval=1.	unsigned int, can be omitted(defaults to 1).
rpn_min_overlap	The lower IoU threshold used to map the anchor boxes to ground truth boxes. If the IoU of an anchor box and any ground truth box is below this threshold, you can treat this anchor box as a negative anchor box.	Float type, scalar. Should be in the interval (0, 1).	0.3
rpn_max_overlap	The upper IoU threshold used to map the anchor boxes to ground truth boxes. If the IoU of an anchor box and at least one ground truth box is above this threshold, you can treat this anchor box as a positive anchor box.	Float type, scalar. Should be in the interval (0, 1) and greater than rpn_min_overlap.	0.7
classifier_min_overlap	The lower IoU threshold to generate the proposal target. If the IoU of an ROI and a ground truth box is above the threshold and below the classifier_max _overlap, then this ROI is regarded as a negative ROI(background) when training the RCNN.	floating-point number, scalar. Should be in the interval [0, 1).	0.0
classifier_max_overlap	Similar to the classifier_min _overlap. If the IoU of a ROI and a ground truth box is above this threshold, then this ROI is regarded as a positive ROI and this ground truth box is treated as the target(ground truth) of this ROI when training the RCNN.	Float type, scalar. Should be in the interval (0, 1) and greater than classifier_min _overlap.	0.5
gt_as_roi	A Boolean value to specify whether or not to include the ground truth boxes into the positive ROI to train the RCNN.	Boolean(True or False)	False
std_scaling	The scaling factor to multiply by for the RPN regression loss when training the RPN.	Float type, should be positive.	1.0
classifier_regr_std	The scaling factor to divide by for the RCNN regression loss when training the RCNN.	map(dict) type. Map from ‘x’, ‘y’, ‘w’, ‘h’ to its corresponding scaling factor. Each of the scaling factors should be a positive float number.	Copy Copied! `classifier_regr _std { key: 'x' value: 10.0 } classifier_regr _std { key: 'y' value: 10.0 } classifier_regr _std { key: 'w' value: 5.0 } classifier_regr _std { key: 'h' value: 5.0 }`
rpn_mini_batc h	The anchor batch size used to train the RPN.	unsigned int, positive.	256
rpn_pre_nms_top_N	The number of boxes to be retained before the NMS in Proposal layer.	unsigned int, positive.
rpn_nms_max_boxes	The number of boxes to be retained after the NMS in Proposal layer.	unsigned int, positive and should be no greater than the rpn_pre_nms_top_N
rpn_nms_overlap_threshold	The IoU threshold for the NMS in Proposal layer.	Float type, should be in the interval (0, 1).	0.7
reg_config	Regularizer configuration of the model weights, including the regularizer type and weight decay.	message that contains two sub-fields: reg_type and weight_decay. See below for details.
reg_config.reg_type	The regularizer type. Can be either ‘L1’(L1 regularizer), ‘L2’(L2 regularizer), or ‘none’(No regularizer).	Str type. Should be one of the below: ‘L1’, ‘L2’, or ‘none’.
reg_config.weight_decay	The weight decay for the regularizer.	Float type, should be a positive scalar. Usually this number should be smaller than 1.0
optimizer	The Optimizer used for the training. Can be either SGD, RMSProp or Adam.	oneof message type that can be one of sgd message, rmsprop message or adam message. See below for the details of each message type.
adam	Adam optimizer.	message type that contains the 4 sub-fields: lr, beta_1, beta_2, and epsilon. See the Keras 2.2.4 documentation for the meaning of each field. Note: When the learning rate scheduler is enabled, the learning rate in the optimizer is overridden by the learning rate scheduler and the one specified in the optimizer(lr) is irrelevant.
sgd	SGD optimizer	message type that contains the following fields: lr, momentum, decayand nesterov. See the Keras 2.2.4 documentation for the meaning of each field. Note: When the learning rate scheduler is enabled, the learning rate in the optimizer is overridden by the learning rate scheduler and the one specified in the optimizer(lr) is irrelevant.
rmsprop	RMSProp optimizer	message type that contains only one field: lr(learning rate). Note: When learning rate scheduler is enabled, the learning rate in the optimizer is overridden by the learning rate scheduler and the one specified in the optimizer(lr) is irrelevant.
lr_scheduler	The learning rate scheduler.	message type that can be stepor soft_start. stepscheduler is the same as stepscheduler in classification, while soft_startis the same as soft_annealin classification. Refer to the classification spec file documentation for details.
lambda_rpn_regr	The loss scaling factor for RPN deltas regression loss.	Float typer. Should be a positive scalar.	1.0
lambda_rpn_class	The loss scaling factor for RPN classification loss.	Float type. Should be a positive scalar.	1.0
lambda_cls_regr	The loss scaling factor for RCNN deltas regression loss.	Float type. Should be a positive scalar.	1.0
lambda_cls_class	The loss scaling factor for RCNN classification loss.	Float type. Should be a positive scalar.	1.0
inference_config	The inference configuration for tlt-infer.	message type. See below for details.
inference_config.images_dir	The absolute path to the image directory that tlt-infer will do inference on.	Str type. Should be a valid Unix path.
inference_config.model	The absolute path to the the .tlt model that tlt-infer will do inference for.	Str type. Should be a valid Unix path.
inference_config.detection _image_output_dir	The absolute path to the output image directory for the detection result. If the path doesn’t exist tlt-infer will create it. If the directory already contains images tlt-inferwill overwrite them.	Str type. Should be a valid Unix path.
inference_config.labels _dump_dir	The absolute path to the directory to save the detected labels in KITTI format. tlt-infer will create it if it doesn’t xist beforehand. If it already contains label files, tlt-infer will overwrite them.	Str type. Should be a valid Unix path.
inference_config.rpn _pre_nms_top_N	The number of top ROI’s to be retained before the NMS in Proposal layer.	unsigned int, positive.
inference_config.rpn _nms_max_boxes	The number of top ROI’s to be retained after the NMS in Proposal layer.	unsigned int, positive.
inference_config.rpn_nms _overlap_threshold	The IoU threshold for the NMS in Proposal layer.	Float type, should be in the interval (0, 1).	0.7
inference_config.bbox _visualize_threshold	The confidence threshold for the bounding boxes to be regarded as valid detected objects in the images.	Float type, should be in the interval (0, 1).	0.6
inference_config.classifier _nms_max_boxes	The number of bounding boxes to be retained after the NMS in RCNN.	unsigned int, positive.	300
inference_config.classifier _nms_overlap _threshold	The IoU threshold for the NMS in RCNN.	Float type. Should be in the interval (0, 1).	0.3
inference_config.bbox _caption_on	Whether or not to show captions for each bounding box in the detected images. The captions include the class name and confidence probability value for each detected object.	Boolean(True or False)	False
inference_config.trt _inference	The TensorRT inference configuration for tlt-inferin TensorRT backend mode.	Message type. This can be not present, and in this case, tlt-inferwill use TLT as a backend for inference. See below for details.
inference_config.trt _inference.trt_infer_model	The model configuration for the tlt-inferin TensorRT backend mode. It is a oneof wrapper of the two possible model configurations: trt_engine and etlt_model. Only one of them can be specified if run tlt-infer in TensorRT backend. If trt_engine is provided, tlt-infer will run TensorRT inference on the TensorRT engine file. If .etlt model is provided, tlt-infer will run TensorRT inference on the .etlt model. If in INT8 mode a calibration cache file should also be provided along with the .etlt model.	message type, oneof wrapper of trt_engineand etlt_model. See below for details.
inference_config.trt _inference.trt_engine	The absolute path to the TensorRT engine file for tlt-infer in TensorRT backend mode. The engine should be generated via the tlt-exportor tlt-converter command line tools.	Str type.
inference_config.trt_inference .etlt_model	The configuration for the .etlt model and the calibration cache(only needed in INT8 mode) for tlt-infer in TensorRT backend mode. The .etlt model(and calibration cache, if needed) should be generated via the tlt-export command line tool.	message type that contains two string type sub-fields: model and calibration_cache. See below for details.
inference _config.trt _inference.etlt_model.model	The absolute path to the .etlt model that tlt-infer will use to run TensorRT based inference.	Str type.
inference_config.trt _inference.etlt _model.calibration _cache	The path to the TensorRT INT8 calibration cache file in the case of tlt-infer run with.etlt model in INT8 mode.	Str type.
inference_config.trt _inference.trt_data_type	The TensorRT inference data type if tlt-infer runs with TensorRT backend. The data type is only useful when running on a .etlt model. In that case, if the data type is ‘int8’, a calibration cache file should also be provided as mentioned above. If running on a TensorRT engine file directly, this field will be ignored since the engine file already contains the data type information.	String type. Valid values are ‘fp32’, ‘fp16’ and’int8’.	‘fp32’
evaluation_config	The configuration for the tlt-evaluate in FasterRCNN.	message type that contains the below fields. See below for details.
evaluation_config.model	The absolute path to the .tlt model that tlt-evaluate will do evaluation for.	Str type. Should be a valid Unix path.
evaluation_config.labels _dump_dir	The absolute path to the directory of detected labels that tlt-evaluate will save. If it doesn’t exist, tlt-evaluate will create it. If it already contains label files, tlt-evaluate will overwrite them.	Str type. Should be a valid Unix path.
evaluation_config.rpn _pre_nms_top_N	The number of top ROIs to be retained before the NMS in Proposal layer in tlt-evaluate.	unsigned int, positive.
evaluation _config.rpn _nms_max_boxes	The number of top ROIs to be retained after the NMS in Proposal layer in tlt-evaluate.	unsigned int, positive. Should be no greater than the evaluation_config.rpn _pre_nms_top_N.
evaluation_config.rpn _nms_iou_threshold	The IoU threshold for the NMS in Proposal layer in tlt-evaluate.	Float type in the interval (0, 1).	0.7
evaluation_config .classifier_nms_max _boxes	The number of top bounding boxes to be retained after the NMS in RCNN in tlt-evaluate.	Unsigned int, positive.
evaluation_config.classifier _nms_overlap_threshold	The IoU threshold for the NMS in RCNN in tlt-evaluate.	Float typer in the interval (0, 1).	0.3
evaluation_config.object _confidence_thres	The confidence threshold above which a bounding box can be regarded as a valid object detected by FasterRCNN. Usually you can use a small threshold to improve the recall and mAP as in many object detection challenges.	Float type in the interval (0, 1).	0.0001
evaluation_config.use_voc07 _11point_metric	Whether to use the VOC2007 mAP calculation method when computing the mAP of the FasterRCNN model on a specific dataset. If this is False, you can use VOC2012 metric instead.	Boolean (True or False)	False

Specification File for SSD

Here is a sample of the SSD spec file. It has 6 major components: ssd_config, training_config, eval_config, nms_config, augmentation_config, and dataset_config. The format of the spec file is a protobuf text(prototxt) message and each of its fields can be either a basic data type or a nested message.

Copy
Copied!

            
            random_seed: 42
ssd_config {
  aspect_ratios_global: "[1.0, 2.0, 0.5, 3.0, 1.0/3.0]"
  scales: "[0.05, 0.1, 0.25, 0.4, 0.55, 0.7, 0.85]"
  two_boxes_for_ar1: true
  clip_boxes: false
  loss_loc_weight: 0.8
  focal_loss_alpha: 0.25
  focal_loss_gamma: 2.0
  variances: "[0.1, 0.1, 0.2, 0.2]"
  arch: "resnet"
  nlayers: 18
  freeze_bn: false
  freeze_blocks: 0
}
training_config {
  batch_size_per_gpu: 16
  num_epochs: 80
  enable_qat: false
  learning_rate {
  soft_start_annealing_schedule {
    min_learning_rate: 5e-5
    max_learning_rate: 2e-2
    soft_start: 0.15
    annealing: 0.8
    }
  }
  regularizer {
    type: L1
    weight: 3e-5
  }
}
eval_config {
  validation_period_during_training: 10
  average_precision_mode: SAMPLE
  batch_size: 16
  matching_iou_threshold: 0.5
}
nms_config {
  confidence_threshold: 0.01
  clustering_iou_threshold: 0.6
  top_k: 200
}
augmentation_config {
  preprocessing {
    output_image_width: 1248
    output_image_height: 384
    output_image_channel: 3
    crop_right: 1248
    crop_bottom: 384
    min_bbox_width: 1.0
    min_bbox_height: 1.0
  }
  spatial_augmentation {
    hflip_probability: 0.5
    vflip_probability: 0.0
    zoom_min: 0.7
    zoom_max: 1.8
    translate_max_x: 8.0
    translate_max_y: 8.0
  }
  color_augmentation {
    hue_rotation_max: 25.0
    saturation_shift_max: 0.20000000298
    contrast_scale_max: 0.10000000149
    contrast_center: 0.5
  }
}
dataset_config {
  data_sources: {
    tfrecords_path: "/workspace/tlt-experiments/data/tfrecords/kitti_trainval/kitti_trainval*"
    image_directory_path: "/workspace/tlt-experiments/data/training"
  }
  image_extension: "png"
  target_class_mapping {
      key: "car"
      value: "car"
  }
  target_class_mapping {
      key: "pedestrian"
      value: "pedestrian"
  }
  target_class_mapping {
      key: "cyclist"
      value: "cyclist"
  }
  target_class_mapping {
      key: "van"
      value: "car"
  }
  target_class_mapping {
      key: "person_sitting"
      value: "pedestrian"
  }
validation_fold: 0
}

The top level structure of the spec file is summarized in the table below.

Training Config

The training configuration(training_config) defines the parameters needed for the training, evaluation and inference. Details are summarized in the table below.

Field	Description	Data Type and Constraints	Recommended/Typical Value
batch_size_per_gpu	The batch size for each GPU, so the effective batch size is batch_size_per_gpu * num_gpus	Unsigned int, positive
num_epochs	The anchor batch size used to train the RPN.	Unsigned int, positive
enable_qat	Whether to use quantization aware training	Boolean
learning_rate	Only soft_start_annealing_schedule with these nested parameters is supported. min_learning_rate: minimum learning late to be seen during the entire experiment. max_learning_rate: maximum learning rate to be seen during the entire experiment soft_start: Time to be lapsed before warm up ( expressed in percentage of progress between 0 and 1) annealing: Time to start annealing the learning rate	Message type
regularizer	This parameter configures the regularizer to be used while training and contains the following nested parameters. type: The type or regularizer to use. NVIDIA supports NO_REG, L1 or L2 weight: The floating point value for regularizer weight	Message type	L1 Note: NVIDIA suggests using L1 regularizer when training a network before pruning as L1 regularization helps making the network weights more prunable.

Evaluation Config

The evaluation configuration (eval_config) defines the parameters needed for the evaluation either during training or standalone. Details are summarized in the table below.

Field	Description	Data Type and Constraints	Recommended/Typical Value
validation_period_during_training	The number of training epochs per which one validation should run.	Unsigned int, positive	10
average_precision_mode	Average Precision (AP) calculation mode can be either SAMPLE or INTEGRATE. SAMPLE is used as VOC metrics for VOC 2009 or before. INTEGRATE is used for VOC 2010 or after that.	ENUM type ( SAMPLE or INTEGRATE)	SAMPLE
matching_iou_threshold	The lowest iou of predicted box and ground truth box that can be considered a match.	Boolean	0.5

NMS Config

The NMS configuration (nms_config) defines the parameters needed for the NMS postprocessing. NMS config applies to the NMS layer of the model in training, validation, evaluation, inference and export. Details are summarized in the table below.

Field	Description	Data Type and Constraints	Recommended/Typical Value
confidence_threshold	Boxes with a confidence score less than confidence_threshold are discarded before applying NMS	float	0.01
cluster_iou_threshold	IOU threshold below which boxes will go through NMS process	float	0.6
top_k	top_k boxes will be outputted after the NMS keras layer. If the number of valid boxes is less than k, the returned array will be padded with boxes whose confidence score is 0.	Unsigned int	200

Augmentation Config

The augmentation configuration (augmentation_config) defines the parameters needed for data augmentation. The configuration is shared with DetectNet_v2. See Augmentation Module for more information.

Dataset Config

The dataset configuration (dataset_config) defines the parameters needed for the data loader. The configuration is shared with DetectNet_v2. See Dataloader for more information.

SSD config

The SSD configuration (ssd_config) defines the parameters needed for building the SSD model. Details are summarized in the table below.

Field	Description	Data Type and Constraints	Recommended/Typical Value
aspect_ratios_global	Anchor boxes of aspect ratios defined in aspect_ratios_global will be generated for each feature layer used for prediction. Note: Only one of aspect_ratios_global or aspect_ratios is required.	string	“[1.0, 2.0, 0.5, 3.0, 0.33]”
aspect_ratios	The length of the outer list must be equivalent to the number of feature layers used for anchor box generation. And the i-th layer will have anchor boxes with aspect ratios defined in aspect_ratios[i]. Note: Only one of aspect_ratios_global or aspect_ratios is required.	string	“[[1.0,2.0,0.5], [1.0,2.0,0.5], [1.0,2.0,0.5], [1.0,2.0,0.5], [1.0,2.0,0.5], [1.0, 2.0, 0.5, 3.0, 0.33]]”
two_boxes_for_ar1	This setting is only relevant for layers that have 1.0 as the aspect ratio. If two_boxes_for_ar1 is true, two boxes will be generated with an aspect ratio of 1. One whose scale is the scale for this layer and the other one whose scale is the geometric mean of the scale for this layer and the scale for the next layer.	Boolean	True
clip_boxes	If true, all corner anchor boxes will be truncated so they are fully inside the feature images.	Boolean	False
scales	scales is a list of positive floats containing scaling factors per convolutional predictor layer. This list must be one element longer than the number of predictor layers, so if two_boxes_for_ar1 is true, the second aspect ratio 1.0 box for the last layer can have a proper scale. Except for the last element in this list, each positive float is the scaling factor for boxes in that layer. For example, if for one layer the scale is 0.1, then the generated anchor box with aspect ratio 1 for that layer (the first aspect ratio 1 box if two_boxes_for_ar1 is true) will have its height and width as 0.1*min(img_h, img_w). min_scale and max_scale are two positive floats. If both of them appear in the config, the program can automatically generate the scales by evenly splitting the space between min_scale and max_scale.	string	“[0.05, 0.1, 0.25, 0.4, 0.55, 0.7, 0.85]”
min_scale/max_scale	If both appear in the config, scales will be generated evenly by splitting the space between min_scale and max_scale.	float
loss_loc_weight	This is a positive float controlling how much location regression loss should contribute to the final loss. The final loss is calculated as classification_loss + loss_loc_weight * loc_loss	float	1.0
focal_loss_alpha	Alpha is the focal loss equation.	float	0.25
focal_loss_gamma	Gamma is the focal loss equation.	float	2.0
variances	Variances should be a list of 4 positive floats. The four floats, in order, represent variances for box center x, box center y, log box height, log box width. The box offset for box center (cx, cy) and log box size (height/width) w.r.t. anchor will be divided by their respective variance value. Therefore, larger variances result in less significant differences between two different boxes on encoded offsets.
steps	An optional list inside quotation marks whose length is the number of feature layers for prediction. The elements should be floats or tuples/lists of two floats. Steps define how many pixels apart the anchor box center points should be. If the element is a float, both vertical and horizontal margin is the same. Otherwise, the first value is step_vertical and the second value is step_horizontal. If steps are not provided, anchor boxes will be distributed uniformly inside the image.	string
offsets	An optional list of floats inside quotation marks whose length is the number of feature layers for prediction. The first anchor box will have offsets[i]steps[i] pixels margin from the left and top borders. If offsets are not provided, 0.5 will be used as default value.*	string
arch	Backbone for feature extraction. Currently, “resnet”, “vgg”, “darknet”, “googlenet”, “mobilenet_v1”, “mobilenet_v2” and “squeezenet” are supported.	string	resnet
nlayers	Number of conv layers in specific arch. For “resnet”, 10, 18, 34, 50 and 101 are supported. For “vgg”, 16 and 19 are supported. For “darknet”, 19 and 53 are supported. All other networks don’t have this configuration and users should just delete this config from the config file.	Unsigned int
freeze_bn	Whether to freeze all batch normalization layers during training.	boolean	False
freeze_blocks	The list of block IDs to be frozen in the model during training. You can choose to freeze some of the CNN blocks in the model to make the training more stable and/or easier to converge. The definition of a block is heuristic for a specific architecture. For example, by stride or by logical blocks in the model, etc. However, the block ID numbers identify the blocks in the model in a sequential order so you don’t have to know the exact locations of the blocks when you do training. A general principle to keep in mind is: the smaller the block ID, the closer it is to the model input; the larger the block ID, the closer it is to the model output. You can divide the whole model into several blocks and optionally freeze a subset of it. Note that for FasterRCNN you can only freeze the blocks that are before the ROI pooling layer. Any layer after the ROI pooling layer will not be frozen any way. For different backbones, the number of blocks and the block ID for each block are different. It deserves some detailed explanations on how to specify the block ID’s for each backbone.	list(repeated integers) ResNet series. For the ResNet series, the block IDs valid for freezing is any subset of [0, 1, 2, 3] (inclusive) VGG series. For the VGG series, the block IDs valid for freezing is any subset of[1, 2, 3, 4, 5] (inclusive) GoogLeNet. For the GoogLeNet, the block IDs valid for freezing is any subset of[0, 1, 2, 3, 4, 5, 6, 7] (inclusive) MobileNet V1. For the MobileNet V1, the block IDs valid for freezing is any subset of [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11](inclusive) MobileNet V2. For the MobileNet V2, the block IDs valid for freezing is any subset of [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13](inclusive) DarkNet. For the DarkNet 19 and DarkNet 53, the block IDs valid for freezing is any subset of [0, 1, 2, 3, 4, 5](inclusive)

Specification File for DSSD

Below is a sample for the DSSD spec file. It has 6 major components: dssd_config, training_config, eval_config, nms_config, augmentation_config, and dataset_config. The format of the spec file is a protobuf text (prototxt) message and each of its fields can be either a basic data type or a nested message. The top level structure of the spec file is summarized in the table below.

Copy
Copied!

            
            random_seed: 42
dssd_config {
  aspect_ratios_global: "[1.0, 2.0, 0.5, 3.0, 1.0/3.0]"
  scales: "[0.05, 0.1, 0.25, 0.4, 0.55, 0.7, 0.85]"
  two_boxes_for_ar1: true
  clip_boxes: false
  loss_loc_weight: 0.8
  focal_loss_alpha: 0.25
  focal_loss_gamma: 2.0
  variances: "[0.1, 0.1, 0.2, 0.2]"
  arch: "resnet"
  nlayers: 18
  pred_num_channels: 512
  freeze_bn: false
  freeze_blocks: 0
}
training_config {
  batch_size_per_gpu: 16
  num_epochs: 80
  enable_qat: false
  learning_rate {
  soft_start_annealing_schedule {
    min_learning_rate: 5e-5
    max_learning_rate: 2e-2
    soft_start: 0.15
    annealing: 0.8
    }
  }
  regularizer {
    type: L1
    weight: 3e-5
  }
}
eval_config {
  validation_period_during_training: 10
  average_precision_mode: SAMPLE
  batch_size: 16
  matching_iou_threshold: 0.5
}
nms_config {
  confidence_threshold: 0.01
  clustering_iou_threshold: 0.6
  top_k: 200
}
augmentation_config {
  preprocessing {
    output_image_width: 1248
    output_image_height: 384
    output_image_channel: 3
    crop_right: 1248
    crop_bottom: 384
    min_bbox_width: 1.0
    min_bbox_height: 1.0
  }
  spatial_augmentation {
    hflip_probability: 0.5
    vflip_probability: 0.0
    zoom_min: 0.7
    zoom_max: 1.8
    translate_max_x: 8.0
    translate_max_y: 8.0
  }
  color_augmentation {
    hue_rotation_max: 25.0
    saturation_shift_max: 0.20000000298
    contrast_scale_max: 0.10000000149
    contrast_center: 0.5
  }
}
dataset_config {
  data_sources: {
    tfrecords_path: "/workspace/tlt-experiments/data/tfrecords/kitti_trainval/kitti_trainval*"
    image_directory_path: "/workspace/tlt-experiments/data/training"
  }
  image_extension: "png"
  target_class_mapping {
      key: "car"
      value: "car"
  }
  target_class_mapping {
      key: "pedestrian"
      value: "pedestrian"
  }
  target_class_mapping {
      key: "cyclist"
      value: "cyclist"
  }
  target_class_mapping {
      key: "van"
      value: "car"
  }
  target_class_mapping {
      key: "person_sitting"
      value: "pedestrian"
  }
validation_fold: 0
}

Training Config

The training configuration (training_config) defines the parameters needed for the training, evaluation and inference. Details are summarized in the table below.

Field	Description	Data Type and Constraints	Recommended/Typical Value
batch_size_per_gpu	The batch size for each GPU, so the effective batch size is batch_size_per_gpu * num_gpus	Unsigned int, positive
num_epochs	The anchor batch size used to train the RPN.	Unsigned int, positive.
enable_qat	Whether to use quantization aware training	Boolean
learning_rate	Only soft_start_annealing_schedule with these nested parameters is supported. min_learning_rate: minimum learning late to be seen during the entire experiment. max_learning_rate: maximum learning rate to be seen during the entire experiment soft_start: Time to be lapsed before warm up ( expressed in percentage of progress between 0 and 1) annealing: Time to start annealing the learning rate	Message type.
regularizer	This parameter configures the regularizer to be used while training and contains the following nested parameters: type: The type or regularizer to use. NVIDIA supports NO_REG, L1 or L2 weight: The floating point value for regularizer weight	Message type.	L1 (Note: NVIDIA suggests using L1 regularizer when training a network before pruning as L1 regularization helps making the network weights more prunable.)

Evaluation Config

The evaluation configuration (eval_config) defines the parameters needed for the evaluation either during training or standalone. Details are summarized in the table below.

Field	Description	Data Type and Constraints	Recommended/Typical Value
validation_period_during_training	The number of training epochs per which one validation should run.	Unsigned int, positive	10
average_precision_mode	Average Precision (AP) calculation mode can be either SAMPLE or INTEGRATE. SAMPLE is used as VOC metrics for VOC 2009 or before. INTEGRATE is used for VOC 2010 or after that.	ENUM type ( SAMPLE or INTEGRATE)	SAMPLE
matching_iou_threshold	The lowest iou of predicted box and ground truth box that can be considered a match.	Boolean	0.5

NMS Config

The NMS configuration (nms_config) defines the parameters needed for the NMS postprocessing. NMS config applies to the NMS layer of the model in training, validation, evaluation, inference and export. Details are summarized in the table below.

Field	Description	Data Type and Constraints	Recommended/Typical Value
confidence_threshold	Boxes with a confidence score less than confidence_threshold are discarded before applying NMS	float	0.01
cluster_iou_threshold	IOU threshold below which boxes will go through NMS process	float	0.6
top_k	top_k boxes will be outputted after the NMS keras layer. If the number of valid boxes is less than k, the returned array will be padded with boxes whose confidence score is 0.	Unsigned int	200

Augmentation Config

Dataset Config

The dataset configuration (dataset_config) defines the parameters needed for the data loader. The configuration is shared with DetectNet_v2. See Dataloader for more information.

DSSD Config

The DSSD configuration (dssd_config) defines the parameters needed for building the DSSD model. Details are summarized in the table below.

Field	Description	Data Type and Constraints	Recommended/Typical Value
aspect_ratios_global	Anchor boxes of aspect ratios defined in aspect_ratios_global will be generated for each feature layer used for prediction. Note: Only one of aspect_ratios_global or aspect_ratios is required.	string	“[1.0, 2.0, 0.5, 3.0, 0.33]”
aspect_ratios	The length of the outer list must be equivalent to the number of feature layers used for anchor box generation. And the i-th layer will have anchor boxes with aspect ratios defined in aspect_ratios[i]. Note: Only one of aspect_ratios_global or aspect_ratios is required.	string	“[[1.0,2.0,0.5], [1.0,2.0,0.5], [1.0,2.0,0.5], [1.0,2.0,0.5], [1.0,2.0,0.5], [1.0, 2.0, 0.5, 3.0, 0.33]]”
two_boxes_for_ar1	This setting is only relevant for layers that have 1.0 as the aspect ratio. If two_boxes_for_ar1 is true, two boxes will be generated with an aspect ratio of 1. One whose scale is the scale for this layer and the other one whose scale is the geometric mean of the scale for this layer and the scale for the next layer.	Boolean	True
clip_boxes	If true, all corner anchor boxes will be truncated so they are fully inside the feature images.	Boolean	False
scales	scales is a list of positive floats containing scaling factors per convolutional predictor layer. This list must be one element longer than the number of predictor layers, so if two_boxes_for_ar1 is true, the second aspect ratio 1.0 box for the last layer can have a proper scale. Except for the last element in this list, each positive float is the scaling factor for boxes in that layer. For example, if for one layer the scale is 0.1, then the generated anchor box with aspect ratio 1 for that layer (the first aspect ratio 1 box if two_boxes_for_ar1 is true) will have its height and width as 0.1*min(img_h, img_w). min_scale and max_scale are two positive floats. If both of them appear in the config, the program can automatically generate the scales by evenly splitting the space between min_scale and max_scale.	string	“[0.05, 0.1, 0.25, 0.4, 0.55, 0.7, 0.85]”
min_scale/max_scale	If both appear in the config, scales will be generated evenly by splitting the space between min_scale and max_scale.	float
loss_loc_weight	This is a positive float controlling how much location regression loss should contribute to the final loss. The final loss is calculated as classification_loss + loss_loc_weight * loc_loss	float	1.0
focal_loss_alpha	Alpha is the focal loss equation.	float	0.25
focal_loss_gamma	Gamma is the focal loss equation.	float	2.0
variances	Variances should be a list of 4 positive floats. The four floats, in order, represent variances for box center x, box center y, log box height, log box width. The box offset for box center (cx, cy) and log box size (height/width) w.r.t. anchor will be divided by their respective variance value. Therefore, larger variances result in less significant differences between two different boxes on encoded offsets.
steps	An optional list inside quotation marks whose length is the number of feature layers for prediction. The elements should be floats or tuples/lists of two floats. Steps define how many pixels apart the anchor box center points should be. If the element is a float, both vertical and horizontal margin is the same. Otherwise, the first value is step_vertical and the second value is step_horizontal. If steps are not provided, anchor boxes will be distributed uniformly inside the image.	string
offsets	An optional list of floats inside quotation marks whose length is the number of feature layers for prediction. The first anchor box will have offsets[i]*steps[i] pixels margin from the left and top borders. If offsets are not provided, 0.5 will be used as default value.	string
arch	Backbone for feature extraction. Currently, “resnet”, “vgg”, “darknet”, “googlenet”, “mobilenet_v1”, “mobilenet_v2” and “squeezenet” are supported.	string	resnet
nlayers	Number of conv layers in specific arch. For “resnet”, 10, 18, 34, 50 and 101 are supported. For “vgg”, 16 and 19 are supported. For “darknet”, 19 and 53 are supported. All other networks don’t have this configuration and users should just delete this config from the config file.	Unsigned int
pred_num_channels	This setting controls the number of channels of the convolutional layers in the DSSD prediction module. Setting this value to 0 will disable the DSSD prediction module. Supported values for this setting are 0, 256, 512 and 1024. A larger value gives a larger network and usually means the network is harder to train.	Unsigned int	512
freeze_bn	Whether to freeze all batch normalization layers during training.	boolean	False
freeze_blocks	The list of block IDs to be frozen in the model during training. You can choose to freeze some of the CNN blocks in the model to make the training more stable and/or easier to converge. The definition of a block is heuristic for a specific architecture. For example, by stride or by logical blocks in the model, etc. However, the block ID numbers identify the blocks in the model in a sequential order so you don’t have to know the exact locations of the blocks when you do training. A general principle to keep in mind is: the smaller the block ID, the closer it is to the model input; the larger the block ID, the closer it is to the model output. You can divide the whole model into several blocks and optionally freeze a subset of it. Note that for FasterRCNN you can only freeze the blocks that are before the ROI pooling layer. Any layer after the ROI pooling layer will not be frozen any way. For different backbones, the number of blocks and the block ID for each block are different. It deserves some detailed explanations on how to specify the block ID’s for each backbone.	list(repeated integers) ResNet series. For the ResNet series, the block IDs valid for freezing is any subset of [0, 1, 2, 3] (inclusive) VGG series. For the VGG series, the block IDs valid for freezing is any subset of[1, 2, 3, 4, 5] (inclusive) GoogLeNet. For the GoogLeNet, the block IDs valid for freezing is any subset of[0, 1, 2, 3, 4, 5, 6, 7] (inclusive) MobileNet V1. For the MobileNet V1, the block IDs valid for freezing is any subset of [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11](inclusive) MobileNet V2. For the MobileNet V2, the block IDs valid for freezing is any subset of [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13](inclusive) DarkNet. For the DarkNet 19 and DarkNet 53, the block IDs valid for freezing is any subset of [0, 1, 2, 3, 4, 5](inclusive)

Copy
Copied!

            
            dssd_config {
  aspect_ratios_global: "[1.0, 2.0, 0.5, 3.0, 0.33]"
  scales: "[0.1, 0.24166667, 0.38333333, 0.525, 0.66666667, 0.80833333, 0.95]"
  two_boxes_for_ar1: true
  clip_boxes: false
  loss_loc_weight: 1.0
  focal_loss_alpha: 0.25
  focal_loss_gamma: 2.0
  variances: "[0.1, 0.1, 0.2, 0.2]"
  pred_num_channels: 0
  arch: "resnet"
  nlayers: 18
  freeze_bn: True
  freeze_blocks: 0
  freeze_blocks: 1}

Using aspect_ratios_global or aspect_ratios

Note

Only aspect_ratios_global or aspect_ratios is required.

aspect_ratios_global should be a 1-d array inside quotation marks. Anchor boxes of aspect ratios defined in aspect_ratios_global will be generated for each feature layer used for prediction. Example: “[1.0, 2.0, 0.5, 3.0, 0.33]”

aspect_ratios should be a list of lists inside quotation marks. The length of the outer list must be equivalent to the number of feature layers used for anchor box generation. And the i-th layer will have anchor boxes with aspect ratios defined in aspect_ratios[i]. Here’s an example:

Copy
Copied!

            
            "[[1.0,2.0,0.5], [1.0,2.0,0.5], [1.0,2.0,0.5], [1.0,2.0,0.5], [1.0,2.0,0.5], [1.0, 2.0, 0.5, 3.0, 0.33]]"

two_boxes_for_ar1

This setting is only relevant for layers that have 1.0 as the aspect ratio. If two_boxes_for_ar1 is true, two boxes will be generated with an aspect ratio of 1. One whose scale is the scale for this layer and the other one whose scale is the geometric mean of the scale for this layer and the scale for the next layer.

scales or Combination of min_scale and max_scale

Note

Only scales or the combination of min_scale and max_scale is required.

scales should be a 1-d array inside quotation marks. It is a list of positive floats containing scaling factors per convolutional predictor layer. This list must be one element longer than the number of predictor layers, so if two_boxes_for_ar1 is true, the second aspect ratio 1.0 box for the last layer can have a proper scale. Except for the last element in this list, each positive float is the scaling factor for boxes in that layer. For example, if for one layer the scale is 0.1, then the generated anchor box with aspect ratio 1 for that layer (the first aspect ratio 1 box if two_boxes_for_ar1 is true) will have its height and width as 0.1*min(img_h, img_w).

min_scale and max_scale are two positive floats. If both of them appear in the config, the program can automatically generate the scales by evenly splitting the space between min_scale and max_scale.

clip_boxes

If true, all corner anchor boxes will be truncated so they are fully inside the feature images.

loss_loc_weight

This is a positive float controlling how much location regression loss should contribute to the final loss. The final loss is calculated as classification_loss + loss_loc_weight * loc_loss.

focal_loss_alpha and focal_loss_gamma

Focal loss is calculated as:

focal_loss_alpha defines α and focal_loss_gamma defines γ in the formula. NVIDIA recommends α=0.25 and γ=2.0 if you don’t know what values to use.

variances

Variances should be a list of 4 positive floats. The four floats, in order, represent variances for box center x, box center y, log box height, log box width. The box offset for box center (cx, cy) and log box size (height/width) w.r.t. anchor will be divided by their respective variance value. Therefore, larger variances result in less significant differences between two different boxes on encoded offsets. The formula for offset calculation is:

steps

An optional list inside quotation marks whose length is the number of feature layers for prediction. The elements should be floats or tuples/lists of two floats. Steps define how many pixels apart the anchor box center points should be. If the element is a float, both vertical and horizontal margin is the same. Otherwise, the first value is step_vertical and the second value is step_horizontal. If steps are not provided, anchorboxes will be distributed uniformly inside the image.

offsets

An optional list of floats inside quotation marks whose length is the number of feature layers for prediction. The first anchor box will have offsets[i]*steps[i] pixels margin from the left and top borders. If offsets are not provided, 0.5 will be used as default value.

arch

A string indicating which feature extraction architecture you want to use. Currently, “resnet”, “vgg”, “darknet”, “googlenet”, “mobilenet_v1”, “mobilenet_v2” and “squeezenet” are supported.

nlayers

An integer specifying the number of layers of the selected arch. For “resnet”, 10, 18, 34, 50 and 101 are supported. For “vgg”, 16 and 19 are supported. For “darknet”, 19 and 53 are supported. All other networks don’t have this configuration and users should just delete this config from the config file.

freeze_bn

Whether to freeze all batch normalization layers during training.

freeze_blocks

Optionally, you can have more than 1 freeze_blocks field. Weights of layers in those blocks will be freezed during training. See Model config for more information.

Specification File for RetinaNet

Below is a sample for the RetinaNet spec file. It has 6 major components: retinanet_config, training_config, eval_config, nms_config, augmentation_config and dataset_config. The format of the spec file is a protobuf text (prototxt) message and each of its fields can be either a basic data type or a nested message. The top level structure of the spec file is summarized in the table below:

Copy
Copied!

            
            random_seed: 42
retinanet_config {
  aspect_ratios_global: "[1.0, 2.0, 0.5]"
  scales: "[0.045, 0.09, 0.2, 0.4, 0.55, 0.7]"
  two_boxes_for_ar1: false
  clip_boxes: false
  loss_loc_weight: 0.8
  focal_loss_alpha: 0.25
  focal_loss_gamma: 2.0
  variances: "[0.1, 0.1, 0.2, 0.2]"
  arch: "resnet"
  nlayers: 18
  n_kernels: 1
  feature_size: 256
  freeze_bn: false
  freeze_blocks: 0
}
training_config {
  enable_qat: False
  batch_size_per_gpu: 24
  num_epochs: 100
  learning_rate {
  soft_start_annealing_schedule {
    min_learning_rate: 4e-5
    max_learning_rate: 1.5e-2
    soft_start: 0.15
    annealing: 0.5
    }
  }
  regularizer {
    type: L1
    weight: 2e-5
  }
}
eval_config {
  validation_period_during_training: 10
  average_precision_mode: SAMPLE
  batch_size: 32
  matching_iou_threshold: 0.5
}
nms_config {
  confidence_threshold: 0.01
  clustering_iou_threshold: 0.6
  top_k: 200
}
augmentation_config {
  preprocessing {
    output_image_width: 1248
    output_image_height: 384
    output_image_channel: 3
    crop_right: 1248
    crop_bottom: 384
    min_bbox_width: 1.0
    min_bbox_height: 1.0
  }
  spatial_augmentation {
    hflip_probability: 0.5
    vflip_probability: 0.0
    zoom_min: 0.7
    zoom_max: 1.8
    translate_max_x: 8.0
    translate_max_y: 8.0
  }
  color_augmentation {
    hue_rotation_max: 25.0
    saturation_shift_max: 0.2
    contrast_scale_max: 0.1
    contrast_center: 0.5
  }
}
dataset_config {
  data_sources: {
    tfrecords_path: "/workspace/tlt-experiments/data/tfrecords/kitti_trainval/kitti_trainval*"
    image_directory_path: "/workspace/tlt-experiments/data/training"
  }
  image_extension: "png"
  target_class_mapping {
      key: "car"
      value: "car"
  }
  target_class_mapping {
      key: "pedestrian"
      value: "pedestrian"
  }
  target_class_mapping {
      key: "cyclist"
      value: "cyclist"
  }
  target_class_mapping {
      key: "van"
      value: "car"
  }
  target_class_mapping {
      key: "person_sitting"
      value: "pedestrian"
  }
validation_fold: 0
}

Training Config

The training configuration(training_config) defines the parameters needed for the training, evaluation and inference. Details are summarized in the table below.

Field	Description	Data Type and Constraints	Recommended/Typical Value
batch_size_per_gpu	The batch size for each GPU, so the effective batch size is batch_size_per_gpu * num_gpus	Unsigned int, positive
num_epochs	The anchor batch size used to train the RPN.	Unsigned int, positive.
enable_qat	Whether to use quantization aware training	Boolean
learning_rate	Only soft_start_annealing_schedule with these nested parameters is supported. min_learning_rate: minimum learning late to be seen during the entire experiment. max_learning_rate: maximum learning rate to be seen during the entire experiment soft_start: Time to be lapsed before warm up ( expressed in percentage of progress between 0 and 1) annealing: Time to start annealing the learning rate	Message type.
regularizer	This parameter configures the regularizer to be used while training and contains the following nested parameters. type: The type or regularizer to use. NVIDIA supports NO_REG, L1 or L2 weight: The floating point value for regularizer weight	Message type.	L1 (Note: NVIDIA suggests using L1 regularizer when training a network before pruning as L1 regularization helps making the network weights more prunable.)

Evaluation Config

The evaluation configuration (eval_config) defines the parameters needed for the evaluation either during training or standalone. Details are summarized in the table below.

Field	Description	Data Type and Constraints	Recommended/Typical Value
validation_period_during_training	The number of training epochs per which one validation should run.	Unsigned int, positive	10
average_precision_mode	Average Precision (AP) calculation mode can be either SAMPLE or INTEGRATE. SAMPLE is used as VOC metrics for VOC 2009 or before. INTEGRATE is used for VOC 2010 or after that.	ENUM type ( SAMPLE or INTEGRATE)	SAMPLE
matching_iou_threshold	The lowest iou of predicted box and ground truth box that can be considered a match.	Boolean	0.5

NMS Config

Field	Description	Data Type and Constraints	Recommended/Typical Value
confidence_threshold	Boxes with a confidence score less than confidence_threshold are discarded before applying NMS	float	0.01
cluster_iou_threshold	IOU threshold below which boxes will go through NMS process	float	0.6
top_k	top_k boxes will be outputted after the NMS keras layer. If the number of valid boxes is less than k, the returned array will be padded with boxes whose confidence score is 0.	Unsigned int	200

Augmentation Config

Dataset Config

The dataset configuration (dataset_config) defines the parameters needed for the data loader. The configuration is shared with DetectNet_v2. See Dataloader for more information.

RetinaNet Config

The RetinaNet configuration (retinanet_config) defines the parameters needed for building the RetinaNet model. Details are summarized in the table below.

Field	Description	Data Type and Constraints	Recommended/Typical Value
aspect_ratios_global	Anchor boxes of aspect ratios defined in aspect_ratios_global will be generated for each feature layer used for prediction. Note: Only one of aspect_ratios_global or aspect_ratios is required.	string	“[1.0, 2.0, 0.5]”
aspect_ratios	The length of the outer list must be equivalent to the number of feature layers used for anchor box generation. And the i-th layer will have anchor boxes with aspect ratios defined in aspect_ratios[i]. Note: Only one of aspect_ratios_global or aspect_ratios is required.	string	“[[1.0,2.0,0.5], [1.0,2.0,0.5], [1.0,2.0,0.5], [1.0,2.0,0.5], [1.0,2.0,0.5], [1.0, 2.0, 0.5, 3.0, 0.33]]”
two_boxes_for_ar1	This setting is only relevant for layers that have 1.0 as the aspect ratio. If two_boxes_for_ar1 is true, two boxes will be generated with an aspect ratio of 1. One whose scale is the scale for this layer and the other one whose scale is the geometric mean of the scale for this layer and the scale for the next layer.	Boolean	True
clip_boxes	If true, all corner anchor boxes will be truncated so they are fully inside the feature images.	Boolean	False
scales	scales is a list of positive floats containing scaling factors per convolutional predictor layer. This list must be one element longer than the number of predictor layers, so if two_boxes_for_ar1 is true, the second aspect ratio 1.0 box for the last layer can have a proper scale. Except for the last element in this list, each positive float is the scaling factor for boxes in that layer. For example, if for one layer the scale is 0.1, then the generated anchor box with aspect ratio 1 for that layer (the first aspect ratio 1 box if two_boxes_for_ar1 is true) will have its height and width as 0.1*min(img_h, img_w). min_scale and max_scale are two positive floats. If both of them appear in the config, the program can automatically generate the scales by evenly splitting the space between min_scale and max_scale.	string	“[0.05, 0.1, 0.25, 0.4, 0.55, 0.7, 0.85]”
min_scale/max_scale	If both appear in the config, scales will be generated evenly by splitting the space between min_scale and max_scale.	float
loss_loc_weight	This is a positive float controlling how much location regression loss should contribute to the final loss. The final loss is calculated as classification_loss + loss_loc_weight * loc_loss	float	1.0
focal_loss_alpha	Alpha is the focal loss equation.	float	0.25
focal_loss_gamma	Gamma is the focal loss equation.	float	2.0
variances	Variances should be a list of 4 positive floats. The four floats, in order, represent variances for box center x, box center y, log box height, log box width. The box offset for box center (cx, cy) and log box size (height/width) w.r.t. anchor will be divided by their respective variance value. Therefore, larger variances result in less significant differences between two different boxes on encoded offsets.
steps	An optional list inside quotation marks whose length is the number of feature layers for prediction. The elements should be floats or tuples/lists of two floats. Steps define how many pixels apart the anchor box center points should be. If the element is a float, both vertical and horizontal margin is the same. Otherwise, the first value is step_vertical and the second value is step_horizontal. If steps are not provided, anchor boxes will be distributed uniformly inside the image.	string
offsets	An optional list of floats inside quotation marks whose length is the number of feature layers for prediction. The first anchor box will have offsets[i]*steps[i] pixels margin from the left and top borders. If offsets are not provided, 0.5 will be used as default value.	string
arch	Backbone for feature extraction. Currently, “resnet”, “vgg”, “darknet”, “googlenet”, “mobilenet_v1”, “mobilenet_v2” and “squeezenet” are supported.	string	resnet
nlayers	Number of conv layers in specific arch. For “resnet”, 10, 18, 34, 50 and 101 are supported. For “vgg”, 16 and 19 are supported. For “darknet”, 19 and 53 are supported. All other networks don’t have this configuration and users should just delete this config from the config file.	Unsigned int
freeze_bn	Whether to freeze all batch normalization layers during training.	boolean	False
freeze_blocks	The list of block IDs to be frozen in the model during training. You can choose to freeze some of the CNN blocks in the model to make the training more stable and/or easier to converge. The definition of a block is heuristic for a specific architecture. For example, by stride or by logical blocks in the model, etc. However, the block ID numbers identify the blocks in the model in a sequential order so you don’t have to know the exact locations of the blocks when you do training. A general principle to keep in mind is: the smaller the block ID, the closer it is to the model input; the larger the block ID, the closer it is to the model output. You can divide the whole model into several blocks and optionally freeze a subset of it. Note that for FasterRCNN you can only freeze the blocks that are before the ROI pooling layer. Any layer after the ROI pooling layer will not be frozen any way. For different backbones, the number of blocks and the block ID for each block are different. It deserves some detailed explanations on how to specify the block ID’s for each backbone.	list(repeated integers) ResNet series. For the ResNet series, the block IDs valid for freezing is any subset of [0, 1, 2, 3] (inclusive) VGG series. For the VGG series, the block IDs valid for freezing is any subset of[1, 2, 3, 4, 5] (inclusive) GoogLeNet. For the GoogLeNet, the block IDs valid for freezing is any subset of[0, 1, 2, 3, 4, 5, 6, 7] (inclusive) MobileNet V1. For the MobileNet V1, the block IDs valid for freezing is any subset of [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11](inclusive) MobileNet V2. For the MobileNet V2, the block IDs valid for freezing is any subset of [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13](inclusive) DarkNet. For the DarkNet 19 and DarkNet 53, the block IDs valid for freezing is any subset of [0, 1, 2, 3, 4, 5](inclusive)
n_kernels	This setting controls the number of convolutional layers in the RetinaNet subnets for classification and anchor box regression. A larger value generates a larger network and usually means the network is harder to train.	Unsigned int	2
feature_size	This setting controls the number of channels of the convolutional layers in the RetinaNet subnets for classification and anchor box regression. A larger value gives a larger network and usually means the network is harder to train. Note that RetinaNet FPN generates 5 feature maps, thus the scales field requires a list of 6 scaling factors. The last number is not used if two_boxes_for_ar1 is set to False. There are also three underlying scaling factors at each feature map level (2^0, 2^⅓, 2^⅔ ).	Unsigned int	256

Focal loss is calculated as follows:

Variances:

Specification File for YOLOv3

Below is a sample for the YOLOv3 spec file. It has 6 major components: yolo_config, training_config, eval_config, nms_config, augmentation_config, and dataset_config. The format of the spec file is a protobuf text (prototxt) message and each of its fields can be either a basic data type or a nested message. The top level structure of the spec file is summarized in the table below.

Copy
Copied!

            
            random_seed: 42
yolo_config {
  big_anchor_shape: "[(116,90), (156,198), (373,326)]"
  mid_anchor_shape: "[(30,61), (62,45), (59,119)]"
  small_anchor_shape: "[(10,13), (16,30), (33,23)]"
  matching_neutral_box_iou: 0.5
  arch: "darknet"
  nlayers: 53
  arch_conv_blocks: 2
  loss_loc_weight: 5.0
  loss_neg_obj_weights: 50.0
  loss_class_weights: 1.0
  freeze_bn: True
  freeze_blocks: 0
  freeze_blocks: 1}
training_config {
  batch_size_per_gpu: 16
  num_epochs: 80
  enable_qat: false
  learning_rate {
  soft_start_annealing_schedule {
    min_learning_rate: 5e-5
    max_learning_rate: 2e-2
    soft_start: 0.15
    annealing: 0.8
    }
  }
  regularizer {
    type: L1
    weight: 3e-5
  }
}
eval_config {
  validation_period_during_training: 10
  average_precision_mode: SAMPLE
  batch_size: 16
  matching_iou_threshold: 0.5
}
nms_config {
  confidence_threshold: 0.01
  clustering_iou_threshold: 0.6
  top_k: 200
}
augmentation_config {
  preprocessing {
    output_image_width: 1248
    output_image_height: 384
    output_image_channel: 3
    crop_right: 1248
    crop_bottom: 384
    min_bbox_width: 1.0
    min_bbox_height: 1.0
  }
  spatial_augmentation {
    hflip_probability: 0.5
    vflip_probability: 0.0
    zoom_min: 0.7
    zoom_max: 1.8
    translate_max_x: 8.0
    translate_max_y: 8.0
  }
  color_augmentation {
    hue_rotation_max: 25.0
    saturation_shift_max: 0.20000000298
    contrast_scale_max: 0.10000000149
    contrast_center: 0.5
  }
}
dataset_config {
  data_sources: {
    tfrecords_path: "/workspace/tlt-experiments/data/tfrecords/kitti_trainval/kitti_trainval*"
    image_directory_path: "/workspace/tlt-experiments/data/training"
  }
  image_extension: "png"
  target_class_mapping {
      key: "car"
      value: "car"
  }
  target_class_mapping {
      key: "pedestrian"
      value: "pedestrian"
  }
  target_class_mapping {
      key: "cyclist"
      value: "cyclist"
  }
  target_class_mapping {
      key: "van"
      value: "car"
  }
  target_class_mapping {
      key: "person_sitting"
      value: "pedestrian"
  }
validation_fold: 0
}

Training Config

The training configuration(training_config) defines the parameters needed for the training, evaluation and inference. Details are summarized in the table below.

Field	Description	Data Type and Constraints	Recommended/Typical Value
batch_size_per_gpu	The batch size for each GPU, so the effective batch size is batch_size_per_gpu * num_gpus	Unsigned int, positive
num_epochs	The anchor batch size used to train the RPN	Unsigned int, positive.
enable_qat	Whether to use quantization aware training	Boolean
learning_rate	Only soft_start_annealing_schedule with these nested parameters is supported. min_learning_rate: minimum learning late to be seen during the entire experiment max_learning_rate: maximum learning rate to be seen during the entire experiment soft_start: Time to be lapsed before warm up ( expressed in percentage of progress between 0 and 1) annealing: Time to start annealing the learning rate	Message type.
regularizer	This parameter configures the regularizer to be used while training and contains the following nested parameters. type: The type or regularizer to use. NVIDIA supports NO_REG, L1 or L2 weight: The floating point value for regularizer weight	Message type.	L1 (Note: NVIDIA suggests using L1 regularizer when training a network before pruning as L1 regularization helps making the network weights more prunable.)

Evaluation Config

The evaluation configuration (eval_config) defines the parameters needed for the evaluation either during training or standalone. Details are summarized in the table below.

Field	Description	Data Type and Constraints	Recommended/Typical Value
validation_period_during_training	The number of training epochs per which one validation should run.	Unsigned int, positive	10
average_precision_mode	Average Precision (AP) calculation mode can be either SAMPLE or INTEGRATE. SAMPLE is used as VOC metrics for VOC 2009 or before. INTEGRATE is used for VOC 2010 or after that.	ENUM type ( SAMPLE or INTEGRATE)	SAMPLE
matching_iou_threshold	The lowest iou of predicted box and ground truth box that can be considered a match.	Boolean	0.5

NMS Config

Field	Description	Data Type and Constraints	Recommended/Typical Value
confidence_threshold	Boxes with a confidence score less than confidence_threshold are discarded before applying NMS	float	0.01
cluster_iou_threshold	IOU threshold below which boxes will go through NMS process	float	0.6
top_k	top_k boxes will be outputted after the NMS keras layer. If the number of valid boxes is less than k, the returned array will be padded with boxes whose confidence score is 0.	Unsigned int	200

Augmentation Config

Dataset Config

The dataset configuration (dataset_config) defines the parameters needed for the data loader. The configuration is shared with DetectNet_v2. See Dataloader for more information.

YOLOv3 Config

The YOLOv3 configuration (yolo_config) defines the parameters needed for building the DSSD model. Details are summarized in the table below.

Field	Description	Data Type and Constraints	Recommended/Typical Value
big_anchor_shape, mid_anchor_shape, and small_anchor_shape	Those settings should be 1-d arrays inside quotation marks. The elements of those arrays are tuples representing the pre-defined anchor shape in the order of width, height. The default YOLOv3 has 9 predefined anchor shapes. They are divided into 3 groups corresponding to big, medium and small objects. The detection output corresponding to different groups are from different depths in the network. Users should run the kmeans.py file attached with the example notebook to determine the best anchor shapes for their own dataset and put those anchor shapes in the spec file. It is worth noting that the number of anchor shapes for any field is not limited to 3. Users only need to specify at least 1 anchor shape in each of those three fields.	string	Use kmeans.py attached in examples/yolo inside docker to generate those shapes
matching_neutral_box_iou	This field should be a float number between 0 and 1. Any anchor not matching to ground truth boxes, but with IOU higher than this float value with any ground truth box, will not have their objectiveness loss back-propagated during training. This is to reduce false negatives.	float	0.5
arch_conv_blocks	Supported values are 0, 1 and 2. This value controls how many convolutional blocks are present among detection output layers. Setting this value to 2 if you want to reproduce the meta architecture of the original YOLOv3 model coming with DarkNet 53. Please note this config setting only controls the size of the YOLO meta architecture and the size of the feature extractor has nothing to do with this config field.	0, 1 or 2	2
loss_loc_weight, loss_neg_obj_weights, and loss_class_weights	Those loss weights can be configured as float numbers. The YOLOv3 loss is a summation of localization loss, negative objectiveness loss, positive objectiveness loss and classification loss. The weight of positive objectiveness loss is set to 1 while the weights of other losses are read from config file.	float	loss_loc_weight: 5.0 loss_neg_obj_weights: 50.0 loss_class_weights: 1.0
arch	Backbone for feature extraction. Currently, “resnet”, “vgg”, “darknet”, “googlenet”, “mobilenet_v1”, “mobilenet_v2” and “squeezenet” are supported.	string	resnet
nlayers	Number of conv layers in specific arch. For “resnet”, 10, 18, 34, 50 and 101 are supported. For “vgg”, 16 and 19 are supported. For “darknet”, 19 and 53 are supported. All other networks don’t have this configuration and users should just delete this config from the config file.	Unsigned int
freeze_bn	Whether to freeze all batch normalization layers during training.	boolean	False
freeze_blocks	The list of block IDs to be frozen in the model during training. You can choose to freeze some of the CNN blocks in the model to make the training more stable and/or easier to converge. The definition of a block is heuristic for a specific architecture. For example, by stride or by logical blocks in the model, etc. However, the block ID numbers identify the blocks in the model in a sequential order so you don’t have to know the exact locations of the blocks when you do training. A general principle to keep in mind is: the smaller the block ID, the closer it is to the model input; the larger the block ID, the closer it is to the model output. You can divide the whole model into several blocks and optionally freeze a subset of it. Note that for FasterRCNN you can only freeze the blocks that are before the ROI pooling layer. Any layer after the ROI pooling layer will not be frozen any way. For different backbones, the number of blocks and the block ID for each block are different. It deserves some detailed explanations on how to specify the block ID’s for each backbone.	list(repeated integers) ResNet series. For the ResNet series, the block IDs valid for freezing is any subset of [0, 1, 2, 3] (inclusive) VGG series. For the VGG series, the block IDs valid for freezing is any subset of[1, 2, 3, 4, 5] (inclusive) GoogLeNet. For the GoogLeNet, the block IDs valid for freezing is any subset of[0, 1, 2, 3, 4, 5, 6, 7] (inclusive) MobileNet V1. For the MobileNet V1, the block IDs valid for freezing is any subset of [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11](inclusive) MobileNet V2. For the MobileNet V2, the block IDs valid for freezing is any subset of [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13](inclusive) DarkNet. For the DarkNet 19 and DarkNet 53, the block IDs valid for freezing is any subset of [0, 1, 2, 3, 4, 5](inclusive)

Specification File for MaskRCNN

Below is a sample for the MaskRCNN spec file. It has 3 major components: top level experiment configs, data_config and maskrcnn_config, explained below in detail. The format of the spec file is a protobuf text (prototxt) message and each of its fields can be either a basic data type or a nested message. The top level structure of the spec file is summarized in the table below.

Here’s a sample of the MaskRCNN spec file:

Copy
Copied!

            
            seed: 123
use_amp: False
warmup_steps: 0
checkpoint: "/workspace/tlt-experiments/maskrcnn/pretrained_resnet50/tlt_instance_segmentation_vresnet50/resnet50.hdf5"
learning_rate_steps: "[60000, 80000, 100000]"
learning_rate_decay_levels: "[0.1, 0.02, 0.002]"
total_steps: 120000
train_batch_size: 2
eval_batch_size: 4
num_steps_per_eval: 10000
momentum: 0.9
l2_weight_decay: 0.0001
warmup_learning_rate: 0.0001
init_learning_rate: 0.02

data_config{
        image_size: "(832, 1344)"
        augment_input_data: True
        eval_samples: 500
        training_file_pattern: "/workspace/tlt-experiments/data/train*.tfrecord"
        validation_file_pattern: "/workspace/tlt-experiments/data/val*.tfrecord"
        val_json_file: "/workspace/tlt-experiments/data/annotations/instances_val2017.json"

        # dataset specific parameters
        num_classes: 91
        skip_crowd_during_training: True
}

maskrcnn_config {
        nlayers: 50
        arch: "resnet"
        freeze_bn: True
        freeze_blocks: "[0,1]"
        gt_mask_size: 112

        # Region Proposal Network
        rpn_positive_overlap: 0.7
        rpn_negative_overlap: 0.3
        rpn_batch_size_per_im: 256
        rpn_fg_fraction: 0.5
        rpn_min_size: 0.

        # Proposal layer.
        batch_size_per_im: 512
        fg_fraction: 0.25
        fg_thresh: 0.5
        bg_thresh_hi: 0.5
        bg_thresh_lo: 0.

        # Faster-RCNN heads.
        fast_rcnn_mlp_head_dim: 1024
        bbox_reg_weights: "(10., 10., 5., 5.)"

        # Mask-RCNN heads.
        include_mask: True
        mrcnn_resolution: 28

        # training
        train_rpn_pre_nms_topn: 2000
        train_rpn_post_nms_topn: 1000
        train_rpn_nms_threshold: 0.7

        # evaluation
        test_detections_per_image: 100
        test_nms: 0.5
        test_rpn_pre_nms_topn: 1000
        test_rpn_post_nms_topn: 1000
        test_rpn_nms_thresh: 0.7

        # model architecture
        min_level: 2
        max_level: 6
        num_scales: 1
        aspect_ratios: "[(1.0, 1.0), (1.4, 0.7), (0.7, 1.4)]"
        anchor_scale: 8

        # localization loss
        rpn_box_loss_weight: 1.0
        fast_rcnn_box_loss_weight: 1.0
        mrcnn_weight_loss_mask: 1.0
}

Field	Description	Data Type and Constraints	Recommended/Typical Value
seed	The random seed for the experiment.	Unsigned int	123
warmup_steps	The steps taken for learning rate to ramp up to the init_learning_rate.	Unsigned int
warmup_learning_rate	The initial learning rate during in the warmup phase.	float
learning_rate_steps	List of steps, at which the learning rate decays by the factor specified in learning_rate_decay_levels.	string
learning_rate_decay_levels	List of decay factors. The length should match the length of learning_rate_steps.	string
total_steps	Total number of training iterations.	Unsigned int
train_batch_size	Batch size during training.	Unsigned int	4
eval_batch_size	Batch size during validation or evaluation.	Unsigned int	8
num_steps_per_eval	Save a checkpoint and run evaluation every N steps.	Unsigned int
momentum	Momentum of SGD optimizer.	float	0.9
l2_weight_decay	L2 weight decay	float	0.0001
use_amp	Whether to use Automatic Mixed Precision training.	boolean	False
checkpoint	Path to a pretrained model.	string
maskrcnn_config	The architecture of the model.	message
data_config	Input data configuration.	message
skip_checkpoint_variables	If specified, the weights of the layers with matching regular expressions will not be loaded. This is especially helpful for transfer learning.	string

Note

When using skip_checkpoint_variables, you can first find the model structure in the training log (Part of MaskRCNN+ResNet50 model structure is shown below). If, for example, you want to retrain all prediction heads, you can set skip_checkpoint_variables to “head”. TLT uses Python re library to check whether “head” matches any layer name or re.search($skip_checkpoint_variables, $layer_name).

Copy
Copied!

            
            [MaskRCNN] INFO    : ================ TRAINABLE [MaskRCNN] INFO    : [#0001] conv1/kernel:0 [MaskRCNN] INFO    : [#0002] bn_conv1/gamma:0 [MaskRCNN] INFO    : [#0003] bn_conv1/beta:0 [MaskRCNN] INFO    : [#0004] block_1a_conv_1/kernel:0 [MaskRCNN] INFO    : [#0005] block_1a_bn_1/gamma:0 [MaskRCNN] INFO    : [#0006] block_1a_bn_1/beta:0 [MaskRCNN] INFO    : [#0007] block_1a_conv_2/kernel:0 [MaskRCNN] INFO    : [#0008] block_1a_bn_2/gamma:0 [MaskRCNN] INFO    : [#0009] block_1a_bn_2/beta:0 [MaskRCNN] INFO    : [#0010] block_1a_conv_3/kernel:0 [MaskRCNN] INFO    : [#0011] block_1a_bn_3/gamma:0 [MaskRCNN] INFO    : [#0012] block_1a_bn_3/beta:0 [MaskRCNN] INFO    : [#0110] block_3d_bn_3/gamma:0 [MaskRCNN] INFO    : [#0111] block_3d_bn_3/beta:0 [MaskRCNN] INFO    : [#0112] block_3e_conv_1/kernel:0 …     … [MaskRCNN] INFO    : [#0174] fpn/post_hoc_d5/kernel:0 [MaskRCNN] INFO    : [#0175] fpn/post_hoc_d5/bias:0 [MaskRCNN] INFO    : [#0176] rpn_head/rpn/kernel:0 [MaskRCNN] INFO    : [#0177] rpn_head/rpn/bias:0 [MaskRCNN] INFO    : [#0178] rpn_head/rpn-class/kernel:0 [MaskRCNN] INFO    : [#0179] rpn_head/rpn-class/bias:0 [MaskRCNN] INFO    : [#0180] rpn_head/rpn-box/kernel:0 [MaskRCNN] INFO    : [#0181] rpn_head/rpn-box/bias:0 [MaskRCNN] INFO    : [#0182] box_head/fc6/kernel:0 [MaskRCNN] INFO    : [#0183] box_head/fc6/bias:0 [MaskRCNN] INFO    : [#0184] box_head/fc7/kernel:0 [MaskRCNN] INFO    : [#0185] box_head/fc7/bias:0 [MaskRCNN] INFO    : [#0186] box_head/class-predict/kernel:0 [MaskRCNN] INFO    : [#0187] box_head/class-predict/bias:0 [MaskRCNN] INFO    : [#0188] box_head/box-predict/kernel:0 [MaskRCNN] INFO    : [#0189] box_head/box-predict/bias:0 [MaskRCNN] INFO    : [#0190] mask_head/mask-conv-l0/kernel:0 [MaskRCNN] INFO    : [#0191] mask_head/mask-conv-l0/bias:0 [MaskRCNN] INFO    : [#0192] mask_head/mask-conv-l1/kernel:0 [MaskRCNN] INFO    : [#0193] mask_head/mask-conv-l1/bias:0 [MaskRCNN] INFO    : [#0194] mask_head/mask-conv-l2/kernel:0 [MaskRCNN] INFO    : [#0195] mask_head/mask-conv-l2/bias:0 [MaskRCNN] INFO    : [#0196] mask_head/mask-conv-l3/kernel:0 [MaskRCNN] INFO    : [#0197] mask_head/mask-conv-l3/bias:0 [MaskRCNN] INFO    : [#0198] mask_head/conv5-mask/kernel:0 [MaskRCNN] INFO    : [#0199] mask_head/conv5-mask/bias:0 [MaskRCNN] INFO    : [#0200] mask_head/mask_fcn_logits/kernel:0 [MaskRCNN] INFO    : [#0201] mask_head/mask_fcn_logits/bias:0

VARIABLES ================== => (7, 7, 3, 64) => (64,) => (64,) => (1, 1, 64, 64) => (64,) => (64,) => (3, 3, 64, 64) => (64,) => (64,) => (1, 1, 64, 256) => (256,) => (256,) => (1024,) => (1024,) => (1, 1, 1024, [MaskRCNN] INFO    : [#0144] block_4b_bn_1/beta:0                                         => (512,) …    …                                                                       ... => (3, 3, 256, 256) => (256,) => (3, 3, 256, 256) => (256,) => (1, 1, 256, 3) => (3,) => (1, 1, 256, 12) => (12,) => (12544, 1024) => (1024,) => (1024, 1024) => (1024,) => (1024, 91) => (91,) => (1024, 364) => (364,) => (3, 3, 256, 256) => (256,) => (3, 3, 256, 256) => (256,) => (3, 3, 256, 256) => (256,) => (3, 3, 256, 256) => (256,) => (2, 2, 256, 256) => (256,) => (1, 1, 256, 91) => (91,)

MaskRCNN Config

The maskrcnn configuration (maskrcnn_config) defines the model structure. This model is used for training, evaluation and inference. Detailed description is summarized in the table below. Currently, MaskRCNN only supports ResNet10/18/34/50/101 as its backbone.

Field	Description	Data Type and Constraints	Recommended/Typical Value
nlayers	Number of layers in ResNet arch	message	50
arch	The backbone feature extractor name	string	resnet
freeze_bn	Whether to freeze all BatchNorm layers in the backbone	boolean	False
freeze_blocks	List of conv blocks in the backbone to freeze	string ResNet: For the ResNet series, the block IDs valid for freezing is any subset of [0, 1, 2, 3] (inclusive)
gt_mask_size	Groundtruth mask size	Unsigned int	112
rpn_positive_overlap	Lower bound threshold to assign positive labels for anchors	float	0.7
rpn_positive_overlap	Upper bound threshold to assign negative labels for anchors	float	0.3
rpn_batch_size_per_im	The number of sampled anchors per image in RPN	Unsigned int	256
rpn_fg_fraction	Desired fraction of positive anchors in a batch	Unsigned int	0.5
rpn_min_size	Minimum proposal height and width		0
batch_size_per_im	RoI minibatch size per image	Unsigned int	512
fg_fraction	The target fraction of RoI minibatch that is labeled as foreground	float	0.25
fast_rcnn_mlp_head_dim	fast rcnn classification head dimension	Unsigned int	1024
bbox_reg_weights	Bounding box regularization weights	string	“(10, 10, 5, 5)”
include_mask	Whether to include mask head	boolean	True (currently only True is supported)
mrcnn_resolution	Mask head resolution	Unsigned int	28
train_rpn_pre_nms_topn	Number of top scoring RPN proposals to keep before applying NMS (per FPN level)	Unsigned int	2000
train_rpn_post_nms_topn	Number of top scoring RPN proposals to keep after applying NMS (total number produced)	Unsigned int	1000
train_rpn_nms_threshold	NMS IOU threshold in RPN during training	float	0.7
test_detections_per_image	Number of bounding box candidates after NMS	Unsigned int	100
test_nms	NMS IOU threshold during test	float	0.5
test_rpn_pre_nms_topn	Number of top scoring RPN proposals to keep before applying NMS (per FPN level) during test	Unsigned int	1000
test_rpn_post_nms_topn	Number of top scoring RPN proposals to keep after applying NMS (total number produced) during test	Unsigned int	1000
test_rpn_nms_threshold	NMS IOU threshold in RPN during test	float	0.7
min_level	Minimum level of the output feature pyramid	Unsigned int	2
max_level	Maximum level of the output feature pyramid	Unsigned int	6
num_scales	Number of anchor octave scales on each pyramid level (e.g. if it’s set to 3, the anchor scales are [2^0, 2^(1/3), 2^(2/3)])	Unsigned int	1
aspect_ratios	List of tuples representing the aspect ratios of anchors on each pyramid level	string	“[(1.0, 1.0), (1.4, 0.7), (0.7, 1.4)]”
anchor_scale	Scale of base anchor size to the feature pyramid stride	Unsigned int	8
rpn_box_loss_weight	Weight for adjusting RPN box loss in the total loss	float	1.0
fast_rcnn_box_loss_weight	Weight for adjusting FastRCNN box regression loss in the total loss	float	1.0
mrcnn_weight_loss_mask	Weight for adjusting mask loss in the total loss	float	1.0

Note

The min_level, max_level, num_scales, aspect_ratios and anchor_scale are used to determine MaskRCNN’s anchor generation. anchor_scale is the base anchor’s scale. And min_level and max_level sets the range of the scales on different feature maps. For example, the actual anchor scale for the feature map at min_level will be anchor_scale * 2^min_level and the actual anchor scale for the feature map at max_level will be anchor_scale * 2^max_level. And it will generate anchors of different aspect_ratios based on the actual anchor scale.

Data Config

The data configuration (data_config) specifies the input data source and format. This is used for training, evaluation and inference. Detailed description is summarized in the table below.

Field	Description	Data Type and Constraints	Recommended/Typical Value
image_size	Image dimension as a tuple within quote marks. “(height, width)” indicates the dimension of the resized and padded input	string	“(832, 1344)”
augment_input_data	Whether to augment data	boolean	True
eval_samples	Number of samples for evaluation	Unsigned int
training_file_pattern	TFRecord path for training	string
validation_file_pattern	TFRecord path for validation	string
val_json_file	The annotation file path for validation	string
num_classes	Number of classes	Unsigned int
skip_crowd_druing_training	Whether to skip crowd during training	boolean	True