Creating an Experiment Spec File ================================ .. _creating_an_experiment_spec_file: This chapter describes how to create a specification file for model training, inference and evaluation. Specification File for Classification ------------------------------------- .. _specification_file_for_classification: Here is an example of a specification file for model classification: .. code:: model_config { # Model architecture can be chosen from: # ['resnet', 'vgg', 'googlenet', 'alexnet', 'mobilenet_v1', 'mobilenet_v2', 'squeezenet', 'darknet', 'googlenet'] arch: "resnet" # for resnet --> n_layers can be [10, 18, 34, 50, 101] # for vgg --> n_layers can be [16, 19] # for darknet --> n_layers can be [19, 53] n_layers: 18 use_bias: True use_batch_norm: True all_projections: True use_pooling: False freeze_bn: False freeze_blocks: 0 freeze_blocks: 1 # image size should be "3, X, Y", where X,Y >= 16 input_image_size: "3,224,224" } eval_config { eval_dataset_path: "/path/to/your/eval/data" model_path: "/path/to/your/model" top_k: 3 batch_size: 256 n_workers: 8 } train_config { train_dataset_path: "/path/to/your/train/data" val_dataset_path: "/path/to/your/val/data" pretrained_model_path: "/path/to/your/pretrained/model" # optimizer can be chosen from ['adam', 'sgd'] optimizer: "sgd" batch_size_per_gpu: 256 n_epochs: 80 n_workers: 16 # regularizer reg_config { type: "L2" scope: "Conv2D,Dense" weight_decay: 0.00005 } # learning_rate lr_config { # "step" and "soft_anneal" are supported. scheduler: "soft_anneal" # "soft_anneal" stands for soft annealing learning rate scheduler. # the following 4 parameters should be specified if "soft_anneal" is used. learning_rate: 0.005 soft_start: 0.056 annealing_points: "0.3, 0.6, 0.8" annealing_divider: 10 # "step" stands for step learning rate scheduler. # the following 3 parameters should be specified if "step" is used. # learning_rate: 0.006 # step_size: 10 # gamma: 0.1 # "cosine" stands for soft start cosine learning rate scheduler. # the following 2 parameters should be specified if "cosine" is used. # learning_rate: 0.05 # soft_start: 0.01 } } The classification experiment specification can be used with the :code:`tlt-train` and :code:`tlt-evaluate` commands. It consists of three main components: * :code:`model_config` * :code:`eval_config` * :code:`train_config` Model Config ^^^^^^^^^^^^ .. _spec_file_model_config: The table below describes the configurable parameters in the :code:`model_config`. +-----------------------+------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+ | **Parameter** | **Datatype** | **Default** | **Description** | **Supported Values** | +-----------------------+------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+ | all_projections | bool | False | For templates with shortcut connections, this parameter defines whether or not all shortcuts should be instantiated with 1x1 | True/False (only to be used in resnet templates) | | | | | projection layers irrespective of whether there is a change in stride across the input and output. | | +-----------------------+------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+ | arch | string | resnet | This defines the architecture of the back bone feature extractor to be used to train. | | | | | | | * resnet | | | | | | * vgg | | | | | | * mobilenet_v1 | | | | | | * mobilenet_v2 | | | | | | * googlenet | | | | | | | | | | | | | +-----------------------+------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+ | num_layers | int | 18 | Depth of the feature extractor for scalable templates. | | | | | | | * resnets: 10, 18, 34, 50, 101 | | | | | | * vgg: 16, 19 | +-----------------------+------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+ | use_pooling | Boolean | False | Choose between using strided convolutions or MaxPooling while downsampling. When true, MaxPooling is used to down sample, however | False/True | | | | | for the object detection network, NVIDIA recommends setting this to False and using strided convolutions. | | +-----------------------+------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+ | use_batch_norm | Boolean | False | Boolean variable to use batch normalization layers or not. | True/False | +-----------------------+------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+ | freeze_blocks | float | - | This parameter defines which blocks may be frozen from the instantiated feature extractor template, and is different for different | * **ResNet series**: For the ResNet series, the block ID's valid for freezing is any subset of [0, 1, 2, 3](inclusive) | | | (repeated) | | feature extractor templates. | * **VGG series**: For the VGG series, the block ID's valid for freezing is any subset of [1, 2, 3, 4, 5](inclusive) | | | | | | * **MobileNet V1**: For the MobileNet V1, the block ID's valid for freezing is any subset of [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11](inclusive) | | | | | | * **MobileNet V2**: For the MobileNet V2, the block ID's valid for freezing is any subset of [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13](inclusive) | | | | | | * **GoogLeNet**: For the GoogLeNet, the block ID's valid for freezing is any subset of [0, 1, 2, 3, 4, 5, 6, 7](inclusive) | +-----------------------+------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+ | freeze_bn | Boolean | False | You can choose to freeze the Batch | True/False | | | | | Normalizationlayers in the model during training. | | +-----------------------+------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+ | freeze_bn | Boolean | False | You can choose to freeze the Batch | True/False | | | | | Normalizationlayers in the model during training. | | +-----------------------+------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+ | input_image_size | String | "3,224,224" | The dimension of the input layer of the model. Images in the dataset will be resized to this shape by the dataloader, | "C,X,Y", where C=1 or C=3 and X,Y >=16 and X,Y are integers. | | | | | when fed to the model for training. | | +-----------------------+------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+ Eval Config ^^^^^^^^^^^^ .. _spec_file_eval_config: The table below defines the configurable parameters for evaluating a classification model. +-----------------------+------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+ | **Parameter** | **Datatype** | **Default** | **Description** | **Supported Values** | +-----------------------+------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+ | eval_dataset_path | string | | UNIX format path to the root directory of the evaluation dataset. | UNIX format path. | +-----------------------+------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+ | model_path | string | | UNIX format path to the root directory of the model file you would like to evaluate. | UNIX format path. | +-----------------------+------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+ | top_k | int | 5 | The number elements to look at when calculating the top-K classification categorical accuracy metric. | 1, 3, 5 | +-----------------------+------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+ | conf_threshold | float | 0.5 | The confidence threshold of the argmax of the classifier output to be considered as a true positive. | >0.0 | +-----------------------+------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+ | batch_size | int | 256 | Number of images per batch when evaluating the model. | >1 (bound by the number of images that can be fit in the GPU memory) | +-----------------------+------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+ | n_workers | int | 8 | Number of workers fetching batches of images in the evaluation dataloader. | >1 | +-----------------------+------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+ Training Config ^^^^^^^^^^^^^^^ .. _spec_file_training_config: This section defines the configurable parameters for the classification model trainer. +------------------------+-------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+ | **Parameter** | **Datatype** | **Default** | **Description** | **Supported Values** | +------------------------+-------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+ | val_dataset_path | string | | UNIX format path to the root directory of the evaluation dataset. | UNIX format path. | +------------------------+-------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+ | train_dataset_path | string | | UNIX format path to the root directory of the evaluation dataset. | UNIX format path. | +------------------------+-------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+ | pretrained_model_path | string | | UNIX format path to the model file containing the pretrained weights to initialize the model from. | UNIX format path. | +------------------------+-------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+ | batch_size_per_gpu | int | 32 | This parameter defines the number of images per batch per gpu. | >1 | +------------------------+-------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+ | num_epochs | int | 120 | This parameter defines the total number of epochs to run the experiment. | | +------------------------+-------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+ | n_workers | int | False | Number of workers fetching batches of images in the evaluation dataloader. | >1 | +------------------------+-------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+ | learning rate | learning rate scheduler proto | | This nested protobuf txt parameter defines the learning rate schedule to be used | | | | | | with the trainer, when training a classification model. The following parameters are | | | | | | required to configure a valid learning rate scheduler. | | | | | | | | | | | | | | | | | | * scheduler (str): This parameter defines the type of learning rate scheduler to be used. The supported types include: | | | | | | "cosine", "soft_anneal", "step" | | | | | | * learning_rate (float): The starting learning rate of the learning rate scheduler. | | | | | | * soft_start(float): The time (in ratio of the total number of epochs) taken to reach the max learning rate (learning_rate * num_gpus) | 0.0 - 1.0 | | | | | This parameter should be used if the scheduler is set to "cosine" or "soft_anneal". | | | | | | * annealing_points(string): The times (in ratio of the total number of epochs) at which the learning rate will be divided by the | | | | | | annealing divider. To be used only if the scheduler is set to "soft_anneal". | | | | | | * annealing_divider(float): A divider for learning rate applied at each annealing point. | > 1.0 | | | | | To be used only if the scheduler is set to "soft_anneal". | | | | | | * step(float): The time (in ratio of the total number of epochs) to step the learning rate from lr to lr * gamma | 0.0 - 1.0 | | | | | To be used only if the scheduler is set to "step". | | | | | | * gamma(float): The scale factor applied to the learning rate after every step. | 0.0 - 1.0 | | | | | To be used only if the scheduler is set to "step". | | +------------------------+-------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+ | regularizer | regularizer proto config | | This parameter configures the type and the weight of the regularizer to be used during training. The three parameters | The supported values for type are: | | | | | include: | | | | | | | | | | | | | | | | | | | * \* L1, L2, None | | | | | * type(string) : The type of the regularizer being used. | * >0.0 | | | | | * weight_decay(float) : The floating point weight of the regularizer. | * "Conv2D,Dense" | | | | | * scope (str): Comma separated types of layers to which regularization must be applied. TLT recommends using regularizer with the | | | | | | Conv2D and Dense layers of a Deep Neural Network. | | +------------------------+-------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+ | optimizer | string | sgd | This parameter defines which optimizer to use for training_config | [adam, sgd] | +------------------------+-------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+ Specification File for DetectNet_v2 ----------------------------------- .. _specifcation_file_for_detectnet_v2: To do training, evaluation and inference for DetectNet_v2, several components need to be configured, each with their own parameters. The tlt-train and tlt-evaluate commands for a DetectNet_v2 experiment share the same configuration file. The tlt-infer command uses a separate configuration file. The training and inference tools use a specification file for object detection. The specification file for detection training configures these components of the training pipe: * Model * BBox ground truth generation * Post processing module * Cost function configuration * Trainer * Augmentation model * Evaluator * Dataloader Model Config ^^^^^^^^^^^^ .. _model_config: Core object detection can be configured using the model_config option in the spec file. Here's a sample model config to instantiate a resnet18 model with pretrained weights and freeze blocks 0 and 1, with all shortcuts being set to projection layers. .. code:: # Sample model config for to instantiate a resnet18 model with pretrained weights and freeze blocks 0, 1 # with all shortcuts having projection layers. model_config { arch: "resnet" pretrained_model_file: freeze_blocks: 0 freeze_blocks: 1 all_projections: True num_layers: 18 use_pooling: False use_batch_norm: True dropout_rate: 0.0 training_precision: { backend_floatx: FLOAT32 } objective_set: { cov {} bbox { scale: 35.0 offset: 0.5 } } } The following table describes the :code:`model_config` parameters: +-----------------------+------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+ | **Parameter** | **Datatype** | **Default** | **Description** | **Supported Values** | +=======================+==================+=============+=================================================================================================================================================+========================================================================================================================================================+ | all_projections | bool | False | For templates with shortcut connections, this parameter defines whether or not all shortcuts should be instantiated with 1x1 | True/False (only to be used in resnet templates) | | | | | projection layers irrespective of whether there is a change in stride across the input and output. | | +-----------------------+------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+ | arch | string | resnet | This defines the architecture of the back bone feature extractor to be used to train. | | | | | | | resnet | | | | | | vgg | | | | | | mobilenet | | | | | | _v1 | | | | | | mobilenet | | | | | | _v2 | | | | | | googlenet | +-----------------------+------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+ | num_layers | int | 18 | Depth of the feature extractor for scalable templates. | | | | | | | resnets: 10, 18, 34, 50, 101 | | | | | | vgg: 16, 19 | +-----------------------+------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+ | pretrained model file | string | - | This parameter defines the path to a pretrained tlt model file. If the :code:`load_graph flag` is set to :code:`false`, it is assumed that only | Unix path | | | | | the weights of the pretrained model file is to be used. In this case, TLT train constructs the feature extractor graph in the | | | | | | experiment and loads the weights from the pretrained model file whose layer names match. Thus, transfer learning across different | | | | | | resolutions and domains are supported. For layers that may be absent in the pretrained model, the tool initializes them with | | | | | | random weights and skips import for that layer. | | +-----------------------+------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+ | use_pooling | Boolean | False | Choose between using strided convolutions or MaxPooling while downsampling. When true, MaxPooling is used to down sample, however | False/True | | | | | for the object detection network, NVIDIA recommends setting this to False and using strided convolutions. | | +-----------------------+------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+ | use_batch_norm | Boolean | False | Boolean variable to use batch normalization layers or not. | True/False | +-----------------------+------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+ | objective_set | Proto Dictionary | - | This defines what objectives is this network being trained for. For object detection networks, set it to learn cov and bbox. These | cov {} bbox { scale: 35.0 offset: 0.5 } | | | | | parameters should not be altered for the current training pipeline. | | +-----------------------+------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+ | dropout_rate | Float | 0.0 | Probability for drop out | 0.0-0.1 | +-----------------------+------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+ | training precision | Proto Dictionary | - | Contains a nested parameter that sets the precision of the back-end training framework. | backend_floatx: FLOAT32 | +-----------------------+------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+ | load_graph | Boolean | False | Flag to define whether or not to load the graph from the pretrained model file, or just the weights. For a pruned, please remember | True/False | | | | | to set this parameter as True. Pruning modifies the original graph, hence the pruned model graph and the weights need to be imported. | | +-----------------------+------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+ | freeze_blocks | float | - | This parameter defines which blocks may be frozen from the instantiated feature extractor template, and is different for different | * **ResNet series**: For the ResNet series, the block ID's valid for freezing is any subset of [0, 1, 2, 3](inclusive) | | | (repeated) | | feature extractor templates. | * **VGG series**: For the VGG series, the block ID's valid for freezing is any subset of [1, 2, 3, 4, 5](inclusive) | | | | | | * **MobileNet V1**: For the MobileNet V1, the block ID's valid for freezing is any subset of [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11](inclusive) | | | | | | * **MobileNet V2**: For the MobileNet V2, the block ID's valid for freezing is any subset of [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13](inclusive) | | | | | | * **GoogLeNet**: For the GoogLeNet, the block ID's valid for freezing is any subset of [0, 1, 2, 3, 4, 5, 6, 7](inclusive) | +-----------------------+------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+ | freeze_bn | Boolean | False | You can choose to freeze the Batch | True/False | | | | | Normalizationlayers in the model during training. | | +-----------------------+------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------+ BBox Ground Truth Generator ^^^^^^^^^^^^^^^^^^^^^^^^^^^ DetectNet_v2 generates 2 tensors, **cov** and **bbox**. The image is divided into 16x16 grid cells. The cov tensor(short for coverage tensor) defines the number of gridcells that are covered by an object. The bbox tensor defines the normalized image coordinates of the object (x1, y1) top_left and (x2, y2) bottom right with respect to the grid cell. For best results, you can assume the coverage area to be an ellipse within the bbox label, with the maximum confidence being assigned to the cells in the center and reducing coverage outwards. Each class has its own coverage and bbox tensor, thus the shape of the tensors are: * cov: Batch_size, Num_classes, image_height/16, image_width/16 * bbox: Batch_size, Num_classes * 4, image_height/16, image_width/16 (where 4 is the number of coordinates per cell) Here is a sample rasterizer config for a 3 class detector: .. code:: # Sample rasterizer configs to instantiate a 3 class bbox rasterizer bbox_rasterizer_config { target_class_config { key: "car" value: { cov_center_x: 0.5 cov_center_y: 0.5 cov_radius_x: 0.4 cov_radius_y: 0.4 bbox_min_radius: 1.0 } } target_class_config { key: "cyclist" value: { cov_center_x: 0.5 cov_center_y: 0.5 cov_radius_x: 0.4 cov_radius_y: 0.4 bbox_min_radius: 1.0 } } target_class_config { key: "pedestrian" value: { cov_center_x: 0.5 cov_center_y: 0.5 cov_radius_x: 0.4 cov_radius_y: 0.4 bbox_min_radius: 1.0 } } deadzone_radius: 0.67 } The bbox_rasterizer has the following parameters that are configurable: +---------------------+------------------+-------------+--------------------------------------------------------------------------------------------------------------------------+------------------------------+ | **Parameter** | **Datatype** | **Default** | **Description** | **Supported Values** | +=====================+==================+=============+==========================================================================================================================+==============================+ | deadzone radius | float | 0.67 | The area to be considered as dormant (or area of no bboxes) around the ellipse of an object. This is particularly useful | 0-1.0 | | | | | in cases of overlapping objects, so that foreground objects and background objects are not confused. | | +---------------------+------------------+-------------+--------------------------------------------------------------------------------------------------------------------------+------------------------------+ | target_class_config | proto dictionary | | This is a nested configuration field that defines the coverage region for an object of a given class. For each class, | * cov_center_x: 0.0 - 1.0 | | | | | this field is repeated. The configurable parameters of the target_class_config include: | | | | | | | * cov_center_y: 0.0 - 1.0 | | | | | | | * cov_radius_x: 0.0 - 1.0 | | | | | | * cov_radius_y: 0.0 - 1.0 | | | | | * cov_center_x (float): x-coordinate of the center of the object. | * bbox_min_radius: 0.0 - 1.0 | | | | | * cov_center_y (float): y-coordinate of the center of the object. | | | | | | * cov_radius_x (float): x-radius of the coverage ellipse | | | | | | * cov_radius_y (float): y-radius of the coverage ellipse | | | | | | * bbox_min_radius (float):minimum radius of the coverage region to be drawn for boxes. | | +---------------------+------------------+-------------+--------------------------------------------------------------------------------------------------------------------------+------------------------------+ Post processor ^^^^^^^^^^^^^^ The post processor module generates renderable bounding boxes from the raw detection output. The process includes: * Filtering out valid detections by thresholding objects using the confidence value in the coverage tensor * Clustering the raw filtered predictions using DBSCAN to produce the final rendered bounding boxes * Filtering out weaker clusters based on the final confidence threshold derived from the candidate boxes that get grouped into a cluster Here is an example of the definition of the postprocessor for a 3 class network learning for **car**, **cyclist**, and **pedestrian**: .. code:: postprocessing_config { target_class_config { key: "car" value: { clustering_config { coverage_threshold: 0.005 dbscan_eps: 0.15 dbscan_min_samples: 0.05 minimum_bounding_box_height: 20 } } } target_class_config { key: "cyclist" value: { clustering_config { coverage_threshold: 0.005 dbscan_eps: 0.15 dbscan_min_samples: 0.05 minimum_bounding_box_height: 20 } } } target_class_config { key: "pedestrian" value: { clustering_config { coverage_threshold: 0.005 dbscan_eps: 0.15 dbscan_min_samples: 0.05 minimum_bounding_box_height: 20 } } } } This section defines parameters that configure the post processor. For each class you can train for, the :code:`postprocessing_config` has a :code:`target_class_config` element, which defines the clustering parameters for this class. The parameters for each target class include: +---------------+--------------------------+-------------+----------------------------------------------------------------------------------------+----------------------------------------------------+ | **Parameter** | **Datatype** | **Default** | **Description** | **Supported Values** | +===============+==========================+=============+========================================================================================+====================================================+ | key | string | - | The names of the class for which the post processor module is being configured. | The network object class name, which are mentioned | | | | | | in the cost_function_config. | +---------------+--------------------------+-------------+----------------------------------------------------------------------------------------+----------------------------------------------------+ | value | clustering _config proto | - | The nested clustering config proto parameter that configures the postprocessor module. | Encapsulated object with parameters defined below. | | | | | The parameters for this module are defined in the next table. | | +---------------+--------------------------+-------------+----------------------------------------------------------------------------------------+----------------------------------------------------+ The :code:`clustering_config` element configures the clustering block for this class. Here are the parameters for this element: +-----------------------------+--------------+-------------+------------------------------------------------------------------------------------------------------------+-------------------------+ | **Parameter** | **Datatype** | **Default** | **Description** | **Supported Values** | +-----------------------------+--------------+-------------+------------------------------------------------------------------------------------------------------------+-------------------------+ | coverate_threshold | float | - | The minimum threshold of the coverage tensor output to be considered as a valid candidate box for | 0.0 - 1.0 | | | | | clustering. The 4 coordinates from the bbox tensor at the corresponding indices are passed for clustering. | | +-----------------------------+--------------+-------------+------------------------------------------------------------------------------------------------------------+-------------------------+ | dbscan_eps | float | - | The maximum distance between two samples for one to be considered as in the neighborhood of the other. | 0.0 - 1.0 | | | | | This is not a maximum bound on the distances of points within a cluster. The greater the eps, more | | | | | | boxes are grouped together. | | +-----------------------------+--------------+-------------+------------------------------------------------------------------------------------------------------------+-------------------------+ | dbscan_min_samples | float | - | The total weight in a neighborhood for a point to be considered as a core point. This includes the | 0.0 - 1.0 | | | | | point itself. | | +-----------------------------+--------------+-------------+------------------------------------------------------------------------------------------------------------+-------------------------+ | minimum_bounding_box_height | int | - | Minimum height in pixels to consider as a valid detection post clustering. | 0 - input image height | +-----------------------------+--------------+-------------+------------------------------------------------------------------------------------------------------------+-------------------------+ Cost Function ^^^^^^^^^^^^^ This section helps you configure the cost function to include the classes that you are training for. For each class you want to train, add a new entry of the target classes to the spec file. NVIDIA recommends not changing the parameters within the spec file for best performance with these classes. The other parameters remain unchanged here. .. code:: cost_function_config { target_classes { name: "car" class_weight: 1.0 coverage_foreground_weight: 0.05 objectives { name: "cov" initial_weight: 1.0 weight_target: 1.0 } objectives { name: "bbox" initial_weight: 10.0 weight_target: 10.0 } } target_classes { name: "cyclist" class_weight: 1.0 coverage_foreground_weight: 0.05 objectives { name: "cov" initial_weight: 1.0 weight_target: 1.0 } objectives { name: "bbox" initial_weight: 10.0 weight_target: 1.0 } } target_classes { name: "pedestrian" class_weight: 1.0 coverage_foreground_weight: 0.05 objectives { name: "cov" initial_weight: 1.0 weight_target: 1.0 } objectives { name: "bbox" initial_weight: 10.0 weight_target: 10.0 } } enable_autoweighting: True max_objective_weight: 0.9999 min_objective_weight: 0.0001 } Trainer ^^^^^^^ .. _trainer: Here's a sample training_config block to configure a detectnet_v2 trainer: .. code:: training_config { batch_size_per_gpu: 16 num_epochs: 80 learning_rate { soft_start_annealing_schedule { min_learning_rate: 5e-6 max_learning_rate: 5e-4 soft_start: 0.1 annealing: 0.7 } } regularizer { type: L1 weight: 3e-9 } optimizer { adam { epsilon: 1e-08 beta1: 0.9 beta2: 0.999 } } cost_scaling { enabled: False initial_exponent: 20.0 increment: 0.005 decrement: 1.0 } } The following table describes the parameters used to configure the trainer: +---------------------+-------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+ | **Parameter** | **Datatype** | **Default** | **Description** | **Supported Values** | +---------------------+-------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+ | batch_size_per | int | 32 | This parameter defines the number of images per batch per gpu. | >1 | | _gpu | | | | | +---------------------+-------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+ | num_epochs | int | 120 | This parameter defines the total number of epochs to run the experiment. | | +---------------------+-------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+ | enable_qat | bool | False | This parameter enables training a model using Quantization Aware Training (QAT). For | True, False | | | | | more information about QAT see Quantization Aware Training. | | +---------------------+-------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+ | learning rate | learning rate scheduler proto | soft_start | This parameter configures the learning rate schedule for the trainer. Currently | annealing: 0.0-1.0 and greater than soft_start Soft_start: 0.0 - 1.0 | | | | _annealing | detectnet_v2 only supports softstart annealing learning rate schedule, and maybe | | | | | _schedule | configured using the following parameters: | A sample lr plot for a soft start of 0.3 and annealing of 0.1 is shown | | | | | | in the figure below. | | | | | | | | | | | | | | | | | | * soft_start (float): Defines the time to ramp up the learning rate from minimum learning rate to maximum learning rate | | | | | | * annealing (float): Defines the time to cool down the learning rate from maximum learning rate to minimum learning rate | | | | | | * minimum_learning_rate(float): Minimum learning rate in the learning rate schedule. | | | | | | * maximum_learning_rate(float): Maximum learning rate in the learning rate schedule. | | +---------------------+-------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+ | regularizer | regularizer proto config | | This parameter configures the type and the weight of the regularizer to be used during training. The two parameters | The supported values for type are: | | | | | include: | | | | | | | | | | | | | | | | | | | | | * NO_REG | | | | | * type: The type of the regularizer being used. | * L1 | | | | | * weight: The floating point weight of the regularizer. | * L2 | +---------------------+-------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+ | optimizer | optimizer proto config | | This parameter defines which optimizer to use for training, and the parameters to configure it, namely: | | | | | | | | | | | | | | | | | | | | | | | | | * epsilon (float): Is a very small number to prevent any division by zero in the implementation | | | | | | * beta1 (float) | | | | | | * beta2 (float) | | +---------------------+-------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+ | cost_scaling | costscaling | | This parameter enables cost scaling during training. Please leave this parameter untouched currently for the detectnet_v2 training pipe. | cost_scaling { enabled: False initial_exponent: 20.0 increment: 0.005 decrement: 1.0 } | | | _config | | | | +---------------------+-------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+ | checkpoint interval | float | 0/10 | The interval (in epochs) at which tlt-train saves intermediate models. | 0 to num_epochs | +---------------------+-------------------------------+-------------+------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------+ Detectnet_v2 currently supports the soft-start annealing learning rate schedule. The learning rate when plotted as a function of the training progress (0.0, 1.0) results in the following curve. .. image:: ../content/learning_rate.png In this experiment, the soft start was set as 0.3 and annealing as 0.7, with minimum learning rate as 5e-6 and a maximum learning rate or base_lr as 5e-4. .. Note:: NVIDIA suggests using L1 regularizer when training a network before pruning as L1 regularization helps making the network weights more easily pruned. After pruning, when retraining the networks, NVIDIA recommends turning regularization off by setting the regularization type to :code:`NO_REG`. Augmentation Module ^^^^^^^^^^^^^^^^^^^ .. _augmentation_module: The augmentation module provides some basic pre-processing and augmentation when training. Here is a sample :code:`augmentation_config` element: .. code:: # Sample augementation config for augmentation_config { preprocessing { output_image_width: 960 output_image_height: 544 output_image_channel: 3 min_bbox_width: 1.0 min_bbox_height: 1.0 } spatial_augmentation { hflip_probability: 0.5 vflip_probability: 0.0 zoom_min: 1.0 zoom_max: 1.0 translate_max_x: 8.0 translate_max_y: 8.0 } color_augmentation { color_shift_stddev: 0.0 hue_rotation_max: 25.0 saturation_shift_max: 0.2 contrast_scale_max: 0.1 contrast_center: 0.5 } } .. Note:: If the output image height and the output image width of the preprocessing block doesn't match with the dimensions of the input image, the dataloader either pads with zeros, or crops to fit to the output resolution. It does not resize the input images and labels to fit. The :code:`augmentation_config` contains three elements: :code:`preprocessing`: This nested field configures the input image and ground truth label pre-processing module. It sets the shape of the input tensor to the network. The ground truth labels are pre-processed to meet the dimensions of the input image tensors. +---------------+--------------+-----------------------------+----------------------------------------------------------------------------------------+-------------------------+ | **Parameter** | **Datatype** | **Default/Suggested value** | **Description** | **Supported Values** | +---------------+--------------+-----------------------------+----------------------------------------------------------------------------------------+-------------------------+ | output | int | -- | The width of the augmentation output. This is the same as the width of the network | >480 | | _image | | | input and must be a multiple of 16. | | | _width | | | | | +---------------+--------------+-----------------------------+----------------------------------------------------------------------------------------+-------------------------+ | output | int | -- | The height of the augmentation output. This is the same as the height of the network | >272 | | _image | | | input and must be a multiple of 16. | | | _height | | | | | +---------------+--------------+-----------------------------+----------------------------------------------------------------------------------------+-------------------------+ | output | int | 1, 3 | The channel depth of the augmentation output. This is the same as the channel depth of | 1,3 | | _image | | | the network input. Currently 1-channel input is not recommended for datasets with jpg | | | _channel | | | images. For png images, both 3 channel RGB and 1 channel monochrome images are | | | | | | supported. | | +---------------+--------------+-----------------------------+----------------------------------------------------------------------------------------+-------------------------+ | Min_bbox | float | | The minimum height of the object labels to be considered for training. | 0 - output_image_height | | _height | | | | | +---------------+--------------+-----------------------------+----------------------------------------------------------------------------------------+-------------------------+ | Min_bbox | float | | The minimum width of the object labels to be considered for training. | 0 - output_image_width | | _width | | | | | +---------------+--------------+-----------------------------+----------------------------------------------------------------------------------------+-------------------------+ | crop_right | int | | The right boundary of the crop to be extracted from the original image. | 0 - input image width | +---------------+--------------+-----------------------------+----------------------------------------------------------------------------------------+-------------------------+ | crop_left | int | | The left boundary of the crop to be extracted from the original image. | 0 - input image width | +---------------+--------------+-----------------------------+----------------------------------------------------------------------------------------+-------------------------+ | crop_top | int | | The top boundary of the crop to be extracted from the original image. | 0 - input image height | +---------------+--------------+-----------------------------+----------------------------------------------------------------------------------------+-------------------------+ | crop_bottom | int | | The bottom boundary of the crop to be extracted from the original image. | 0 - input image height | +---------------+--------------+-----------------------------+----------------------------------------------------------------------------------------+-------------------------+ | scale_height | float | | The floating point factor to scale the height of the cropped images. | > 0.0 | +---------------+--------------+-----------------------------+----------------------------------------------------------------------------------------+-------------------------+ | scale_width | float | | The floating point factor to scale the width of the cropped images. | > 0.0 | +---------------+--------------+-----------------------------+----------------------------------------------------------------------------------------+-------------------------+ :code:`spatial_augmentation`: This module supports basic spatial augmentation such as flip, zoom and translate which may be configured. +-------------------+--------------+-----------------------------+------------------------------------------------------------------------+---------------------------+ | **Parameter** | **Datatype** | **Default/Suggested value** | **Description** | **Supported Values** | +-------------------+--------------+-----------------------------+------------------------------------------------------------------------+---------------------------+ | hflip_probability | float | 0.5 | The probability to flip an input image horizontally. | 0.0-1.0 | +-------------------+--------------+-----------------------------+------------------------------------------------------------------------+---------------------------+ | vflip_probability | float | 0.0 | The probability to flip an input image vertically. | 0.0-1.0 | +-------------------+--------------+-----------------------------+------------------------------------------------------------------------+---------------------------+ | zoom_min | float | 1.0 | The minimum zoom scale of the input image. | > 0.0 | +-------------------+--------------+-----------------------------+------------------------------------------------------------------------+---------------------------+ | zoom_max | float | 1.0 | The maximum zoom scale of the input image. | > 0.0 | +-------------------+--------------+-----------------------------+------------------------------------------------------------------------+---------------------------+ | translate_max_x | int | 8.0 | The maximum translation to be added across the x axis. | 0.0 - output_image_width | +-------------------+--------------+-----------------------------+------------------------------------------------------------------------+---------------------------+ | translate_max_y | int | 8.0 | The maximum translation to be added across the y axis | 0.0 - output_image_height | +-------------------+--------------+-----------------------------+------------------------------------------------------------------------+---------------------------+ | rotate_rad_max | float | 0.69 | The angle of rotation to be applied to the images and the training | > 0.0 (modulo 2*pi | | | | | labels. The range is defined between [-rotate_rad_max, rotate_rad_max] | | +-------------------+--------------+-----------------------------+------------------------------------------------------------------------+---------------------------+ :code:`color_augmentation`: This module configures the color space transformations, such as color shift, hue_rotation, saturation shift, and contrast adjustment. +----------------------+--------------+-----------------------------+---------------------------------------------------------------------------------------------+----------------------+ | **Parameter** | **Datatype** | **Default/Suggested value** | **Description** | **Supported Values** | +----------------------+--------------+-----------------------------+---------------------------------------------------------------------------------------------+----------------------+ | color_shift_stddev | float | 0.0 | The standard devidation value for the color shift. | 0.0-1.0 | +----------------------+--------------+-----------------------------+---------------------------------------------------------------------------------------------+----------------------+ | hue_rotation_max | float | 25.0 | The maximum rotation angle for the hue rotation matrix. | 0.0-360.0 | +----------------------+--------------+-----------------------------+---------------------------------------------------------------------------------------------+----------------------+ | saturation_shift_max | float | 0.2 | The maximum shift that changes the saturation. A value of 1.0 means no change in saturation | 0.0 - 1.0 | | | | | shift. | | +----------------------+--------------+-----------------------------+---------------------------------------------------------------------------------------------+----------------------+ | contrast_scale_max | float | 0.1 | The slope of the contrast as rotated around the provided center. A value of 0.0 leaves | 0.0 - 1.0 | | | | | the contrast unchanged. | | +----------------------+--------------+-----------------------------+---------------------------------------------------------------------------------------------+----------------------+ | contrast_center | float | 0.5 | The center around which the contrast is rotated. Ideally this is set to half of the maximum | 0.5 | | | | | pixel value. (Since our input images are scaled between 0 and 1.0, you can set this value | | | | | | to 0.5). | | +----------------------+--------------+-----------------------------+---------------------------------------------------------------------------------------------+----------------------+ The dataloader online augmentation pipeline applies spatial and color-space augmentation transformations in the following order: 1. The dataloader first performs the pre-processing operations on the input data (image and labels) read from the tfrecords files. Here the images and labels cropped and scaled based on the parameters mentioned in the :code:`preprocessing` config. The boundaries of generating the cropped image and labels from the original image is defined by the :code:`crop_left`, :code:`crop_right`, :code:`crop_top` and :code:`crop_bottom` parameters. This cropped data is then scaled by the scale factors defined by :code:`scale_height` and :code:`scale_width`. These transformation matrices for these operations are computed globally and do not change per image. 2. The net tensors generated from the pre-processing blocks are then passed through a pipeline of random augmentations in spatial and color domain. The spatial augmentations are applied to both images and the label coordinates, while the color augmentations are applied only to the images. In order to apply color augmentations, the :code:`output_image_channel` parameter must be set to 3. For monochrome tensors color augmentations are not applied. The spatial and color transformation matrices are computed per image based on a uniform distribution along the max and min ranges defined by the :code:`spatial_augmentation` and :code:`color_augmentation` config parameters. 3. Once the spatial and color augmented net input tensors are generated, the output is then padded with zeros or clipped along the right and bottom edge of the image to fit the output dimensions defined in the :code:`preprocessing` config. Configuring the Evaluator ^^^^^^^^^^^^^^^^^^^^^^^^^ The evaluator in the detection training pipeline can be configured using the :code:`evaluation_config` parameters. The following is an example :code:`evaluation_config` element: .. code:: # Sample evaluation config to run evaluation in integrate mode for the given 3 class model, # at every 10th epoch starting from the epoch 1. evaluation_config { average_precision_mode: INTEGRATE validation_period_during_training: 10 first_validation_epoch: 1 minimum_detection_ground_truth_overlap { key: "car" value: 0.7 } minimum_detection_ground_truth_overlap { key: "person" value: 0.5 } minimum_detection_ground_truth_overlap { key: "bicycle" value: 0.5 } evaluation_box_config { key: "car" value { minimum_height: 4 maximum_height: 9999 minimum_width: 4 maximum_width: 9999 } } evaluation_box_config { key: "person" value { minimum_height: 4 maximum_height: 9999 minimum_width: 4 maximum_width: 9999 } } evaluation_box_config { key: "bicycle" value { minimum_height: 4 maximum_height: 9999 minimum_width: 4 maximum_width: 9999 } } } The following tables describe the parameters used to configure evaluation: +-----------------------+------------------+-----------------------------+---------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | **Parameter** | **Datatype** | **Default/Suggested value** | **Description** | **Supported Values** | +-----------------------+------------------+-----------------------------+---------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | average_precision | | Sample | The mode in which the average precision for each class is calculated. | | | | _mode | | | | | | | | | | * SAMPLE: This is the ap calculation mode using 11 evenly spaced recall | | | | | | points as used in the Pascal VOC challenge 2007. | | | | | | | | | | | | * INTEGRATE: This is the ap calculation mode as used in the 2011 challenge | +-----------------------+------------------+-----------------------------+---------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | validation_period | int | 10 | The interval at which evaluation is run during training. The evaluation is | 1 - total number of epochs | | _during_training | | | run at this interval starting from the value of the first validation epoch | | | | | | parameter as specified below. | | +-----------------------+------------------+-----------------------------+---------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | first_validation | int | 30 | The first epoch to start running validation. Ideally it is preferred to wait | 1 - total number of epochs | | _epoch | | | for atleast 20-30% of the total number of epochs before starting evaluation, | | | | | | since the predictions in the initial epochs would be fairly inaccurate. Too | | | | | | many candidate boxes may be sent to clustering and this can cause the | | | | | | evaluation to slow down. | | +-----------------------+------------------+-----------------------------+---------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | minimum_detection | proto dictionary | | Minimum IOU between ground truth and predicted box after clustering to call a | | | _ground_truth_overlap | | | valid detection. This parameter is a repeatable dictionary, and a separate one | | | | | | must be defined for every class. The members include: | | | | | | | | | | | | | | | | | | | | | | | | | * key (string): class name | | | | | | * value (float): intersection over union value | | +-----------------------+------------------+-----------------------------+---------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | evaluation_box_config | proto dictionary | | This nested configuration field configures the min and max box dimensions to be | | | | | | considered as a valid ground truth and prediction for AP calculation. | | +-----------------------+------------------+-----------------------------+---------------------------------------------------------------------------------+----------------------------------------------------------------------------+ The :code:`evaluation_box_config` field has these configurable inputs. +----------------+--------------+-----------------------------+------------------------------------------------------------------------+-------------------------------------+ | **Parameter** | **Datatype** | **Default/Suggested value** | **Description** | **Supported Values** | +----------------+--------------+-----------------------------+------------------------------------------------------------------------+-------------------------------------+ | minimum_height | float | 10 | Minimum height in pixels for a valid ground truth and prediction bbox. | 0. - model image height | +----------------+--------------+-----------------------------+------------------------------------------------------------------------+-------------------------------------+ | minimum_width | float | 10 | Minimum width in pixels for a valid ground truth and prediction bbox. | 0. - model image width | +----------------+--------------+-----------------------------+------------------------------------------------------------------------+-------------------------------------+ | maximum_height | float | 9999 | Maximum height in pixels for a valid ground truth and prediction bbox. | minimum_height - model image height | +----------------+--------------+-----------------------------+------------------------------------------------------------------------+-------------------------------------+ | maximum_width | float | 9999 | Maximum width in pixels for a valid ground truth and prediction bbox. | minimum _width - model image width | +----------------+--------------+-----------------------------+------------------------------------------------------------------------+-------------------------------------+ Dataloader ^^^^^^^^^^ .. _dataloader: The dataloader defines the path to the data you want to train on and the class mapping for classes in the dataset that the network is to be trained for. The following is an example :code:`dataset_config` element: .. code:: dataset_config { data_sources: { tfrecords_path: "" image_directory_path: "" } image_extension: "jpg" target_class_mapping { key: "car" value: "car" } target_class_mapping { key: "automobile" value: "car" } target_class_mapping { key: "heavy_truck" value: "car" } target_class_mapping { key: "person" value: "pedestrian" } target_class_mapping { key: "rider" value: "cyclist" } validation_fold: 0 } In this example the tfrecords is assumed to be multi-fold, and the fold number to validate on is defined. However, evaluation doesn’t necessarily have to be run on a split of the training set. Many ML engineers choose to evaluate the model on a well chosen evaluation dataset that is exclusive of the training dataset. If you prefer to run evaluation on a different validation dataset as opposed to a split from the training dataset, then please convert this dataset into tfrecords as well using the tlt-dataset-convert tool as mentioned `here`_, and use the :code:`validation_data_source` field in the :code:`dataset_config` to define this. In this case, please do not forget to remove the :code:`validation_fold` field from the spec. When generating the TFRecords for evaluation by using the :code:`validation_data_source` field, please review the notes :ref:`here`. .. _here: https://docs.nvidia.com .. code:: validation_data_source: { tfrecords_path: " /tfrecords validation pattern>" image_directory_path: " " } The parameters in :code:`dataset_config` are defined as follows: * :code:`data_sources`: Captures the path to TFrecords to train on. This field contains 2 parameters: * :code:`tfrecords_path`: Path to the individual TFrecords files. This path follows the UNIX style pathname pattern extension, so a common pathname pattern that captures all the tfrecords files in that directory can be used. * :code:`image_directory_path`: Path to the training data root from which the tfrecords was generated. * :code:`image_extension`: Extension of the images to be used. * :code:`target_class_mapping`: This parameter maps the class names in the tfrecords to the target class to be trained in the network. An element is defined for every source class to target class mapping. This field was included with the intention of grouping similar class objects under one umbrella. For eg: car, van, heavy_truck etc may be grouped under automobile. The “key” field is the value of the class name in the tfrecords file, and “value” field corresponds to the value that the network is expected to learn. * :code:`validation_fold`: In case of an n fold tfrecords, you define the index of the fold to use for validation. For *sequencewise* validation please choose the validation fold in the range [0, N-1]. For *random split* partitioning, please force the validation fold index to 0 as the tfrecord is just 2-fold. .. Note:: The class names key in the target_class_mapping must be identical to the one shown in the dataset converter log, so that the correct classes are picked up for training. Specification File for Inference ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This spec file configures the tlt-infer tool of detectnet to generate valid bbox predictions. The inference tool consists of 2 blocks, namely the inferencer and the bbox handler. The inferencer instantiates the model object and preprocessing pipe, which the bbox handler handles the post processing, rendering of bounding boxes and the serialization to KITTI format output labels. Inferencer ********** The inferencer instantiates a model object that generates the raw predictions from the trained model. The model may be defined to run inference in the TLT backend or the TensorRT backend. A sample :code:`inferencer_config` element for the inferencer spec is defined here: .. code:: inferencer_config{ # defining target class names for the experiment. # Note: This must be mentioned in order of the networks classes. target_classes: "car" target_classes: "cyclist" target_classes: "pedestrian" # Inference dimensions. image_width: 1248 image_height: 384 # Must match what the model was trained for. image_channels: 3 batch_size: 16 gpu_index: 0 # model handler config tensorrt_config{ parser: ETLT etlt_model: "/path/to/model.etlt" backend_data_type: INT8 save_engine: true trt_engine: "/path/to/trt/engine/file" calibrator_config{ calibration_cache: "/path/to/calibration/cache" n_batches: 10 batch_size: 16 } } } The :code:`inferencer_config` parameters are explained in the table below. +-----------------+-------------------+-----------------------------+-------------------------------------------------------------------------------------------------+-----------------------------------------------+ | **Parameter** | **Datatype** | **Default/Suggested value** | **Description** | **Supported Values** | +-----------------+-------------------+-----------------------------+-------------------------------------------------------------------------------------------------+-----------------------------------------------+ | target_classes | String (repeated) | None | The names of the target classes the model should output. For a multi-class model this parameter | For example, for the 3 class kitti model | | | | | is repeated N times. The number of classes must be equal to the number of classes and the | it will be: | | | | | order must be the same as the classes in costfunction_config of the training config file. | | | | | | | * car | | | | | | * cyclist | | | | | | * pedestrian | +-----------------+-------------------+-----------------------------+-------------------------------------------------------------------------------------------------+-----------------------------------------------+ | batch_size | int | 1 | The number of images per batch of inference | Max number of images that can be fit in 1 GPU | +-----------------+-------------------+-----------------------------+-------------------------------------------------------------------------------------------------+-----------------------------------------------+ | image_height | int | 384 | The height of the image in pixels at which the model will be inferred. | >16 | +-----------------+-------------------+-----------------------------+-------------------------------------------------------------------------------------------------+-----------------------------------------------+ | image_width | int | 1248 | The width of the image in pixels at which the model will be inferred. | >16 | +-----------------+-------------------+-----------------------------+-------------------------------------------------------------------------------------------------+-----------------------------------------------+ | image_channels | int | 3 | The number of channels per image. | 1,3 | +-----------------+-------------------+-----------------------------+-------------------------------------------------------------------------------------------------+-----------------------------------------------+ | gpu_index | int | 0 | The index of the GPU to run inference on. This is useful only in TLT inference. For tensorRT | | | | | | inference, by default, the GPU of choice in ‘0’. | | +-----------------+-------------------+-----------------------------+-------------------------------------------------------------------------------------------------+-----------------------------------------------+ | tensorrt_config | TensorRTConfig | None | Proto config to instantiate a TensorRT object | | +-----------------+-------------------+-----------------------------+-------------------------------------------------------------------------------------------------+-----------------------------------------------+ | tlt_config | TLTConfig | None | Proto config to instantiate a TLT model object. | | +-----------------+-------------------+-----------------------------+-------------------------------------------------------------------------------------------------+-----------------------------------------------+ As mentioned earlier, the tlt-infer tool is capable of running inference using the native TLT backend and the TensorRT backend. They can be configured by using the tensorrt_config proto element, or the tlt_config proto element respectively. You may use only one of the two in a single spec file. The definitions of the two model objects are: +---------------------+------------------------+-----------------------------+-----------------------------------------------------------------------------------------+------------------------------+ | **Parameter** | **Datatype** | **Default/Suggested value** | **Description** | **Supported Values** | +---------------------+------------------------+-----------------------------+-----------------------------------------------------------------------------------------+------------------------------+ | parser | enum | ETLT | The tensorrt parser to be invoked. Only ETLT parser is supported. | ETLT | +---------------------+------------------------+-----------------------------+-----------------------------------------------------------------------------------------+------------------------------+ | etlt_model | string | None | Path to the exported etlt model file. | Any existing etlt file path. | +---------------------+------------------------+-----------------------------+-----------------------------------------------------------------------------------------+------------------------------+ | backend_data _type | enum | FP32 | The data type of the backend TensorRT inference engine. For int8 mode, please be | FP32 | | | | | | FP16 | | | | | sure to mention the calibration_cache. | INT8 | +---------------------+------------------------+-----------------------------+-----------------------------------------------------------------------------------------+------------------------------+ | save_engine | bool | False | Flag to save a TensorRT engine from the input etlt file. This will save initialization | True, False | | | | | time if inference needs to be run on the same etlt file and there are no changes | | | | | | needed to be made to the inferencer object. | | +---------------------+------------------------+-----------------------------+-----------------------------------------------------------------------------------------+------------------------------+ | trt_engine | string | None | Path to the TensorRT engine file. This acts an I/O parameter. If the path defined here | UNIX path string | | | | | is not an engine file, then the tlt-infer tool creates a new TensorRT engine from the | | | | | | etlt file. If there exists an engine already, the tool, re-instantiates the inferencer | | | | | | from the engine defined here. | | +---------------------+------------------------+-----------------------------+-----------------------------------------------------------------------------------------+------------------------------+ | calibration _config | CalibratorConfig Proto | None | This is a required parameter when running in the int8 inference mode. This proto object | | | | | | contains parameters used to define a calibrator object. Namely: | | | | | | calibration_cache: path to the calibration cache file generated using tlt-export | | +---------------------+------------------------+-----------------------------+-----------------------------------------------------------------------------------------+------------------------------+ TLT_Config ********** +---------------+--------------+-----------------------------+----------------------------------+----------------------+ | **Parameter** | **Datatype** | **Default/Suggested value** | **Description** | **Supported Values** | +---------------+--------------+-----------------------------+----------------------------------+----------------------+ | model | string | None | The path to the .tlt model file. | | +---------------+--------------+-----------------------------+----------------------------------+----------------------+ .. Note:: Since detectnet is a full convolutional neural net, the model can be inferred at a different inference resolution than the resolution at which it was trained. The input dims of the network will be overridden to run inference at this resolution, if they are different from the training resolution. There may be some regression in accuracy when running inference at a different resolution since the convolutional kernels don’t see the object features at this shape. Bbox Handler ************ The bbox handler takes care of the post processing the raw outputs from the inferencer. It performs the following steps: 1. Thresholding the raw outputs to defines grid-cells where the detections may be present per class. 2. Reconstructing the image space coordinates from the raw coordinates of the inferencer. 3. Clustering the raw thresholded predictions. 4. Filtering the clustered predictions per class. 5. Rendering the final bounding boxes on the image in its input dimensions and serializing them to KITTI format metadata. A sample :code:`bbox_handler_config` element is defined below. .. code:: bbox_handler_config{ kitti_dump: true disable_overlay: false overlay_linewidth: 2 classwise_bbox_handler_config{ key:"car" value: { confidence_model: "aggregate_cov" output_map: "car" confidence_threshold: 0.9 bbox_color{ R: 0 G: 255 B: 0 } clustering_config{ coverage_threshold: 0.00 dbscan_eps: 0.3 dbscan_min_samples: 0.05 minimum_bounding_box_height: 4 } } } classwise_bbox_handler_config{ key:"default" value: { confidence_model: "aggregate_cov" confidence_threshold: 0.9 bbox_color{ R: 255 G: 0 B: 0 } clustering_config{ coverage_threshold: 0.00 dbscan_eps: 0.3 dbscan_min_samples: 0.05 minimum_bounding_box_height: 4 } } } } The parameters to configure the bbox handler are defined below. +--------------------------------+------------------------------------+-----------------------------+------------------------------------------------------------------------------------------+----------------------+ | **Parameter** | **Datatype** | **Default/Suggested value** | **Description** | **Supported Values** | +--------------------------------+------------------------------------+-----------------------------+------------------------------------------------------------------------------------------+----------------------+ | kitti_dump | bool | false | Flag to enable saving the final output predictions per image in KITTI format. | true, false | +--------------------------------+------------------------------------+-----------------------------+------------------------------------------------------------------------------------------+----------------------+ | disable_overlay | bool | true | Flag to disable bbox rendering per image. | true, false | +--------------------------------+------------------------------------+-----------------------------+------------------------------------------------------------------------------------------+----------------------+ | overlay _linewidth | int | 1 | Thickness in pixels of the bbox boundaries. | >1 | +--------------------------------+------------------------------------+-----------------------------+------------------------------------------------------------------------------------------+----------------------+ | classwise_bbox _handler_config | ClasswiseCluster Config (repeated) | None | This is a repeated class-wise dictionary of post-processing parameters. DetectNet_v2 | | | | | | uses dbscan clustering to group raw bboxes to final predictions. For models with several | | | | | | output classes, it may be cumbersome to define a separate dictionary for each class. In | | | | | | such a situation, a default class may be used for all classes in the network. | | +--------------------------------+------------------------------------+-----------------------------+------------------------------------------------------------------------------------------+----------------------+ The :code:`classwise_bbox_handler_config` is a Proto object containing several parameters to configure the clustering algorithm as well as the bbox renderer. +-----------------------+------------------------+-------------------------------+-------------------------------------------------------------------------------------------------------+--------------------------------------------------------+ | **Parameter** | **Datatype** | **Default / Suggested value** | **Description** | **Supported Values** | +-----------------------+------------------------+-------------------------------+-------------------------------------------------------------------------------------------------------+--------------------------------------------------------+ | confidence _model | string | aggregate_cov | Algorithm to compute the final confidence of the clustered bboxes. In the aggregate_cov mode, | aggregate_cov, mean_cov | | | | | the final confidence of a detection is the sum of the confidences of all the candidate bboxes | | | | | | in a cluster. In mean_cov mode, the final confidence is the mean confidence of all the bboxes | | | | | | in the cluster. | | +-----------------------+------------------------+-------------------------------+-------------------------------------------------------------------------------------------------------+--------------------------------------------------------+ | confidence _threshold | float | 0.9 in aggregate_cov mode | The threshold applied to the final aggregate confidence values to render the bboxes. | In aggregate_cov: Maybe tuned to any float value > 0.0 | | | | | | | | | | 0.1 in mean_cov_mode | | In mean_cov: 0.0 - 1.0 | +-----------------------+------------------------+-------------------------------+-------------------------------------------------------------------------------------------------------+--------------------------------------------------------+ | bbox_color | BBoxColor Proto Object | None | RGB channel wise color intensity per box. | R: 0 - 255 | | | | | | | | | | | | G: 0 - 255 | | | | | | | | | | | | B: 0 - 255 | +-----------------------+------------------------+-------------------------------+-------------------------------------------------------------------------------------------------------+--------------------------------------------------------+ | clustering_config | ClusteringConfig | None | Proto object to configure the DBSCAN clustering algorithm. Contains the following sub parameters. | coverage _threshold: 0.005 | | | | | | | | | | | coverage_threshold: The threshold applied to the raw network confidence predictions as a first stage | dbscan_eps: 0.3 | | | | | filtering technique. | | | | | | | dbscan_min _samples: 0.05 | | | | | dbscan_eps: (float) The search distance to group together boxes into a single cluster. The lesser the | | | | | | number, the more boxes are detected. Eps of 1.0 groups all boxes into a single cluster. | minimum_bounding _box_height: 4 | | | | | | | | | | | dbscan_min_samples: (float) The weight of the boxes in a cluster. | | | | | | | | | | | | min_bbox_height: (int) The minimum height of the bbox to be clustered. | | +-----------------------+------------------------+-------------------------------+-------------------------------------------------------------------------------------------------------+--------------------------------------------------------+ Specification File for FasterRCNN --------------------------------- Below is a sample of the FasterRCNN spec file. It has two major components: :code:`network_config` and :code:`training_config`, explained below in detail. The format of the spec file is a protobuf text(prototxt) message and each of its fields can be either a basic data type or a nested message. The top level structure of the spec file is summarized in the table below. Here's a sample of the FasterRCNN spec file: .. code:: random_seed: 42 enc_key: 'tlt' verbose: True network_config { input_image_config { image_type: RGB image_channel_order: 'bgr' size_height_width { height: 384 width: 1248 } image_channel_mean { key: 'b' value: 103.939 } image_channel_mean { key: 'g' value: 116.779 } image_channel_mean { key: 'r' value: 123.68 } image_scaling_factor: 1.0 max_objects_num_per_image: 100 } feature_extractor: "resnet:18" anchor_box_config { scale: 64.0 scale: 128.0 scale: 256.0 ratio: 1.0 ratio: 0.5 ratio: 2.0 } freeze_bn: True freeze_blocks: 0 freeze_blocks: 1 roi_mini_batch: 256 rpn_stride: 16 conv_bn_share_bias: False roi_pooling_config { pool_size: 7 pool_size_2x: False } all_projections: True use_pooling:False } training_config { kitti_data_config { data_sources: { tfrecords_path: "/workspace/tlt-experiments/tfrecords/kitti_trainval/kitti_trainval*" image_directory_path: "/workspace/tlt-experiments/data/training" } image_extension: 'png' target_class_mapping { key: 'car' value: 'car' } target_class_mapping { key: 'van' value: 'car' } target_class_mapping { key: 'pedestrian' value: 'person' } target_class_mapping { key: 'person_sitting' value: 'person' } target_class_mapping { key: 'cyclist' value: 'cyclist' } validation_fold: 0 } data_augmentation { preprocessing { output_image_width: 1248 output_image_height: 384 output_image_channel: 3 min_bbox_width: 1.0 min_bbox_height: 1.0 } spatial_augmentation { hflip_probability: 0.5 vflip_probability: 0.0 zoom_min: 1.0 zoom_max: 1.0 translate_max_x: 0 translate_max_y: 0 } color_augmentation { hue_rotation_max: 0.0 saturation_shift_max: 0.0 contrast_scale_max: 0.0 contrast_center: 0.5 } } enable_augmentation: True batch_size_per_gpu: 16 num_epochs: 12 pretrained_weights: "/workspace/tlt-experiments/data/faster_rcnn/resnet18.h5" #resume_from_model: "/workspace/tlt-experiments/data/faster_rcnn/resnet18.epoch2.tlt" #retrain_pruned_model: "/workspace/tlt-experiments/data/faster_rcnn/model_1_pruned.tlt" output_model: "/workspace/tlt-experiments/data/faster_rcnn/frcnn_kitti_resnet18.tlt" rpn_min_overlap: 0.3 rpn_max_overlap: 0.7 classifier_min_overlap: 0.0 classifier_max_overlap: 0.5 gt_as_roi: False std_scaling: 1.0 classifier_regr_std { key: 'x' value: 10.0 } classifier_regr_std { key: 'y' value: 10.0 } classifier_regr_std { key: 'w' value: 5.0 } classifier_regr_std { key: 'h' value: 5.0 } rpn_mini_batch: 256 rpn_pre_nms_top_N: 12000 rpn_nms_max_boxes: 2000 rpn_nms_overlap_threshold: 0.7 reg_config { reg_type: 'L2' weight_decay: 1e-4 } optimizer { adam { lr: 0.00001 beta_1: 0.9 beta_2: 0.999 decay: 0.0 } } lr_scheduler { step { base_lr: 0.00016 gamma: 1.0 step_size: 30 } } lambda_rpn_regr: 1.0 lambda_rpn_class: 1.0 lambda_cls_regr: 1.0 lambda_cls_class: 1.0 inference_config { images_dir: '/workspace/tlt-experiments/data/testing/image_2' model: '/workspace/tlt-experiments/data/faster_rcnn/frcnn_kitti_resnet18.epoch12.tlt' detection_image_output_dir: '/workspace/tlt-experiments/data/faster_rcnn/inference_results_imgs' labels_dump_dir: '/workspace/tlt-experiments/data/faster_rcnn/inference_dump_labels' rpn_pre_nms_top_N: 6000 rpn_nms_max_boxes: 300 rpn_nms_overlap_threshold: 0.7 bbox_visualize_threshold: 0.6 classifier_nms_max_boxes: 300 classifier_nms_overlap_threshold: 0.3 } evaluation_config { model: '/workspace/tlt-experiments/data/faster_rcnn/frcnn_kitti_resnet18.epoch12.tlt' labels_dump_dir: '/workspace/tlt-experiments/data/faster_rcnn/test_dump_labels' rpn_pre_nms_top_N: 6000 rpn_nms_max_boxes: 300 rpn_nms_overlap_threshold: 0.7 classifier_nms_max_boxes: 300 classifier_nms_overlap_threshold: 0.3 object_confidence_thres: 0.0001 use_voc07_11point_metric:False } } +-----------------+------------------------------------------------------------------------------------+--------------------------+-----------------+----------------------+ | **Parameter** | **Datatype** | **Default Value** | **Description** | **Supported Values** | +-----------------+------------------------------------------------------------------------------------+--------------------------+-----------------+----------------------+ | random_seed | The random seed for the experiment. | Unsigned int | 42 | | +-----------------+------------------------------------------------------------------------------------+--------------------------+-----------------+----------------------+ | enc_key | The encoding and decoding key for the TLT models, can be override by the command | Str, should not be empty | - | | | | line arguments of tlt-train, tlt-evaluate and tlt-infer for FasterRCNN. | | | | +-----------------+------------------------------------------------------------------------------------+--------------------------+-----------------+----------------------+ | verbose | Controls the logging level during the experiments. Will print more logs if True. | Boolean(True or False) | False | | +-----------------+------------------------------------------------------------------------------------+--------------------------+-----------------+----------------------+ | network_config | The architecture of the model and its input format. | message | - | | +-----------------+------------------------------------------------------------------------------------+--------------------------+-----------------+----------------------+ | training_config | The configurations for the training, evaluation and inference for this experiment. | message | - | | +-----------------+------------------------------------------------------------------------------------+--------------------------+-----------------+----------------------+ Network Config ^^^^^^^^^^^^^^ The network config(network_config) defines the model structure and the its input format. This model is used for training, evaluation and inference. Detailed description is summarized in the table below. +-----------------------+------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ | **Field** | **Description** | **Data Type and Constraints** | **Recommended/Typical Value** | +-----------------------+------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ | input_image_config | Defines the input image format, including the image channel number, channel order, width and height, | message | - | | | and the preprocessings (subtract per-channel mean and divided by a scaling factor) for it before feeding | | | | | input the model. See below for details. | | | +-----------------------+------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ | input_image _config. | The image type, can be either RGB or gray-scale image. | enum type. Either RGB or GRAYSCALE | RGB | | image_type | | | | +-----------------------+------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ | input_image _config. | The image channel order. | str type. If image_type is RGB, 'rgb' or 'bgr' is valid. If the image_type is GRAYSCALE, | 'bgr' | | image_channel_order | | only 'l' is valid. | | +-----------------------+------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ | input_image_config. | The height and width as the input dimension of the model. | message | - | | size_height_width | | | | +-----------------------+------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ | input_image _config. | Per-channel mean value to subtract by for the image preprocessing. | map(dict) type from channel names to the corresponding mean values. Each of the mean | .. code:: | | image_channel_mean | | values should be non-negative | | | | | | image_channel_mean { | | | | | key: 'b' | | | | | value: 103.939 | | | | | } | | | | | image_channel_mean { | | | | | key: 'g' | | | | | value: 116.779 | | | | | } | | | | | image_channel_mean { | | | | | key: 'r' | | | | | value: 123.68 | | | | | } | +-----------------------+------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ | input_image _config. | Scaling factor to divide by for the image preprocessing. | float type, should be a positive scalar. | 1.0 | | image _scaling_factor | | | | +-----------------------+------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ | input_image _config. | The maximum number of objects in an image for the dataset. Usually, the number of objects in different | unsigned int, should be positive. | 100 | | max_objects | images is different, but there is a maximum number. Setting this field to be no less than this maximum | | | | _num_per_image | number. This field is used to pad the objects number to the same value so you can make multi-batch and | | | | | multi-gpu training of FasterRCNN possible. | | | +-----------------------+------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ | feature_extractor | The feature extractor(backbone) for the FasterRCNN model. FasterRCNN supports 12 backbones. | str type. The architecture can be ResNet, VGG , GoogLeNet, MobileNet or DarkNet. For | - | | | | each specific architecture, it can have different layers or versions. Details listed below. | | | | **Note**: FasterRCNN actually supports another backbone: vgg. This backbone is a VGG16 backbone | | | | | exactly the same as in Keras applications. The layer names matter when loading a pretrained weights. | ResNet series: resnet:10, resnet:18, resnet:34, resnet:50, resnet:101 | | | | If you want to load a pretrained weights that has the same names as VGG16 in the Keras applications, | | | | | you should use this backbone. Since this is indeed duplicated with the vgg:16 backbone, you might | VGG series: vgg:16, vgg:19 | | | | consider using vgg:16 for production. The only use case for the vgg backbone is to reproduce the | | | | | original Caffe implementation of VGG16 FasterRCNN that uses ImageNet weights as pretrained weights. | GoogLeNet: googlenet | | | | | | | | | | MobileNet series: mobilenet_v1, mobilenet_v2 | | | | | | | | | | DarkNet: darknet:19, darknet:53 | | | | | | | | | | Here a notational convention can be used, i.e., for models that can have different numbers | | | | | of layers, use a colon followed by the layer number as the suffix of the model name. | | | | | E.g., resnet: | | +-----------------------+------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ | anchor_box_config | The anchor box configuration defines the set of anchor box sizes and aspect ratios in a FasterRCNN model. | Message type that contains two sub-fields: scale and ratio. Each of them is a list of | - | | | | floating point numbers. The scale field defines the absolute anchor sizes in pixels(at input | | | | | image resolution). The ratio field defines the aspect ratios of each anchor. | | +-----------------------+------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ | freeze_bn | whether or not to freeze all the BatchNormalization layers in the model. You can choose to freeze the | Boolean (True or False) | If you train with a small batch size, usually you need to set the field to be | | | BatchNormalization layers in the model during training. This is a common trick when training a FasterRCNN model. | | True and use good pretrained weights to make the training converge well. But | | | | | if you train with a large batch size(e.g., >=16), you can set it to be False | | | **Note**: Freezing the BatchNormalization layer will only freeze the moving mean and moving variance in it, while | | and let the BatchNormalization layer to calculate the moving mean and moving | | | the gamma and beta parameters are still trainable. | | variance by itself. | +-----------------------+------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ | freeze_blocks | The list of block IDs to be frozen in the model during training. You can choose to freeze some of the CNN blocks | list(repeated integers) | Leave it empty([]) | | | in the model to make the training more stable and/or easier to converge. The definition of a block is heuristic | | | | | for a specific architecture. For example, by stride or by logical blocks in the model, etc. However, the block ID | ResNet series - For the ResNet series, the block IDs valid for freezing is any subset | | | | numbers identify the blocks in the model in a sequential order so you don't have to know the exact locations of | of [0, 1, 2, 3](inclusive) | | | | the blocks when you do training. A general principle to keep in mind is: the smaller the block ID, the closer it | | | | | is to the model input; the larger the block ID, the closer it is to the model output. | VGG series - For the VGG series, the block IDs valid for freezing is any subset | | | | | of[1, 2, 3, 4, 5](inclusive) | | | | You can divide the whole model into several blocks and optionally freeze a subset of it. Note that for FasterRCNN | | | | | you can only freeze the blocks that are before the ROI pooling layer. Any layer after the ROI pooling layer will | GoogLeNet- For the GoogLeNet, the block IDs valid for freezing is any subset | | | | not be frozen any way. For different backbones, the number of blocks and the block ID for each block are different. | of[0, 1, 2, 3, 4, 5, 6, 7](inclusive) | | | | It deserves some detailed explanations on how to specify the block ID's for each backbone. | | | | | | MobileNet V1- For the MobileNet V1, the block IDs valid for freezing is any subset | | | | | of [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11](inclusive) | | | | | | | | | | MobileNet V2- For the MobileNet V2, the block IDs valid for freezing is any subset | | | | | of [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13](inclusive) | | | | | | | | | | DarkNet - For the DarkNet 19 and DarkNet 53, the block IDs valid for freezing is any | | | | | subset of [0, 1, 2, 3, 4,5](inclusive) | | +-----------------------+------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ | roi_mini_batch | The batch size used to train the RCNN after ROI pooling. | A positive integer, usually uses 128, 256, etc. | 256 | +-----------------------+------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ | RPN_stride | The cumulative stride from the model input to the RPN. This value is fixed(16) for current implementation. | positive integer | 16 | +-----------------------+------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ | conv_bn_share_bias | A Boolean value to indicate whether or not to share the bias of the convolution layer and the BatchNormalization | Boolean (True or False) | True | | | (BN) layer immediately after it. Usually you share the bias between them to reduce the model size and avoid redundancy | | | | | of parameters. When using the pretrained weights, make sure the value of this parameter aligns with the actual | | | | | configuration in the pretrained weights otherwise error will be raised when loading the pretrained weights. | | | +-----------------------+------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ | roi_pooling_config | The configuration for the ROI pooling layer. | Message type that contains two sub-fields: pool_size and pool_size_2x. | - | | | | See below for details. | | +-----------------------+------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ | roi_pooling_config. | The output spatial size(height and width) of ROIs. Only square spatial size is supported currently, i.e. height=width. | unsigned int, should be positive. | 7 | | pool_size | | | | +-----------------------+------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ | roi_pooling_config. | A Boolean value to indicate whether to do the ROI pooling at 2*pool_size followed by a 2 x 2 pooling operation or do ROI | Boolean (True or False) | - | | pool _size_2x | pooling directly at pool_size without pooling operation. E.g. if pool_size = 7, and pool_size_2x=True, it means you do ROI | | | | | pooling to get an output that has a spatial size of 14 x 14 followed by a 2 x 2 pooling operation to get the final output tensor. | | | +-----------------------+------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ | all_projections | This field is only useful for models that have shortcuts in it. These models include ResNet series and the MobileNet V2. | Boolean (True or False) | True | | | If all_projections =True, all the pass-through shortcuts will be replaced by a projection layer that has the same number of output | | | | | channels as it. | | | +-----------------------+------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ | use_pooling | This parameter is only useful for VGG series and ResNet series. When use_pooling=True, you can use pooling in the model as the | Boolean (True or False) | False | | | original implementation, otherwise use strided convolution to replace the pooling operations in the model. If you want to improve | | | | | the inference FPS(Frame Per Second) performance, you can try to set use_pooling=False. | | | +-----------------------+------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------+ Training Configuration ^^^^^^^^^^^^^^^^^^^^^^ The training configuration(training_config) defines the parameters needed for the training, evaluation and inference. Details are summarized in the table below. +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | **Field** | **Description** | **Data Type and Constraints** | **Recommended/Typical Value** | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | kitti_data_config | The dataset used for training, evaluation and inference. | Message type. It has the same structure as the dataset_config message in DetectNet_v2 spec file. | - | | | | Refer to the DetectNet_v2 dataset_config documentation for the details. | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | data_augmentation | Defines the data augmentation pipeline during training. | Message type. It has the same structure as the data_augmentation message in the DetectNet_v2 spec | - | | | | file. Refer to the DetectNet_v2 data_augmentation documentation for the details. | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | enable_augmentation | Whether or not to enable the data augmentation during training. If this parameter is False, | Boolean(True or False) | True | | | the training will not have any data augmentation operation even if you have already defined | | | | | the data augmentation pipeline in the data_augmentation field in spec file. This feature is | | | | | mostly used for debugging of the data augmentation pipeline. | | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | batch_size_pe_gpu | The training batch size on each GPU device. The actual total batch size will be batch_size_per_gpu | unsigned int, positive. | Change the batch_size_per _gpu to adapt the capability of your GPU device. | | | multiplied by the number of GPUs in a multi-gpu training scenario. | | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | num_epochs | The number of epochs for the training. | unsigned int, positive. | 20 | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | pretrained_weights | Absolute path to the pretrained weights file used to initialize the training model. The pretrained | Str type. Can be left empty. In that case, the FasterRCNN model will use random initialization for | - | | | weights file can be either a Keras weights file (with .h5 suffix), a Keras model file (with .hdf5 suffix) | its weights. Usually, FasterRCNN model needs a pretrained weights for good convergence of training. | | | | or a TLT model (with .tlt suffix, trained by TLT). If the file is a model file (.tlt or .hdf5), TLT | | | | | will extract the weights from it and then load the weights for initialization. Files with any other | | | | | formats are not supported as pretrained weights. Note that the pretrained weights file is agnostic to the | | | | | input dimensions of the FasterRCNN model so the model you are training can have different input dimensions | | | | | from the input dimensions specified in the pretrained weights. Normally, the pretrained weights file is only | | | | | useful during the initial training phase in a TLT workflow. | | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | resume_from_model | Absolute path to the checkpoint .tlt model that you want to resume the training from. This is useful in some | Str type. Leave it empty when you are not resuming the training, i.e., train from epoch 0. | - | | | cases when the training process is interrupted for some reason and you don't want to redo the training from | | | | | epoch 0(or 1 in 1-based indexing). In that case, you can use the last checkpoint as the model you will resume | | | | | from, to save the training time. | | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | retrain_pruned_model | Path to the pruned model that you can load and do the retraining. This is used in the retraining phase in a TLT | Str type. Leave it empty when you are not in the retraining phase. | - | | | workflow. The model is the output model of the pruning phase. | | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | output_model | Absolute path to the output .tlt model that the training/retraining will save. Note that this path is not the | Str type. Cannot be empty. | - | | | actual path of the .tlt models. For example, if the output_model is '/workspace/tlt_training/resnet18.tlt', then | | | | | the actual output model path will be '/workspace/tlt_training/resnet18 .epoch.tlt'where denotes the epoch | | | | | number of during training. In this way, you can distinguish the output models for different epochs. Here, the epoch | | | | | number is a 1-based index. | | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | checkpoint_interval | The epoch interval that controls how frequent TLT will save the checkpoint during training. TLT will save the checkpoint | unsigned int, can be omitted(defaults to 1). | - | | | at every checkpoint _interval epoch(1 based index). For example, if the num_epochs is 12 and checkpoint _interval is 3, | | | | | then TLT will save checkpoint at the end of epoch 3, 6, 9, and 12. If this parameter is not specified, then it defaults | | | | | to checkpoint _interval=1. | | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | rpn_min_overlap | The lower IoU threshold used to map the anchor boxes to ground truth boxes. If the IoU of an anchor box and any ground | Float type, scalar. Should be in the interval (0, 1). | 0.3 | | | truth box is below this threshold, you can treat this anchor box as a negative anchor box. | | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | rpn_max_overlap | The upper IoU threshold used to map the anchor boxes to ground truth boxes. If the IoU of an anchor box and at least one | Float type, scalar. Should be in the interval (0, 1) and greater than rpn_min_overlap. | 0.7 | | | ground truth box is above this threshold, you can treat this anchor box as a positive anchor box. | | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | classifier_min_overlap | The lower IoU threshold to generate the proposal target. If the IoU of an ROI and a ground truth box is above the threshold | floating-point number, scalar. Should be in the interval [0, 1). | 0.0 | | | and below the classifier_max _overlap, then this ROI is regarded as a negative ROI(background) when training the RCNN. | | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | classifier_max_overlap | Similar to the classifier_min _overlap. If the IoU of a ROI and a ground truth box is above this threshold, then this ROI is | Float type, scalar. Should be in the interval (0, 1) and greater than classifier_min _overlap. | 0.5 | | | regarded as a positive ROI and this ground truth box is treated as the target(ground truth) of this ROI when training the RCNN. | | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | gt_as_roi | A Boolean value to specify whether or not to include the ground truth boxes into the positive ROI to train the RCNN. | Boolean(True or False) | False | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | std_scaling | The scaling factor to multiply by for the RPN regression loss when training the RPN. | Float type, should be positive. | 1.0 | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | classifier_regr_std | The scaling factor to divide by for the RCNN regression loss when training the RCNN. | map(dict) type. Map from 'x', 'y', 'w', 'h' to its corresponding scaling factor. Each of the scaling | .. code:: | | | | factors should be a positive float number. | | | | | | classifier_regr _std { | | | | | key: 'x' | | | | | value: 10.0 | | | | | } | | | | | classifier_regr _std { | | | | | key: 'y' | | | | | value: 10.0 | | | | | } | | | | | classifier_regr _std { | | | | | key: 'w' | | | | | value: 5.0 | | | | | } | | | | | classifier_regr _std { | | | | | key: 'h' | | | | | value: 5.0 | | | | | } | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | rpn_mini_batc | The anchor batch size used to train the RPN. | unsigned int, positive. | 256 | | h | | | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | rpn_pre_nms_top_N | The number of boxes to be retained before the NMS in Proposal layer. | unsigned int, positive. | - | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | rpn_nms_max_boxes | The number of boxes to be retained after the NMS in Proposal layer. | unsigned int, positive and should be no greater than the rpn_pre_nms_top_N | - | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | rpn_nms_overlap_threshold | The IoU threshold for the NMS in Proposal layer. | Float type, should be in the interval (0, 1). | 0.7 | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | reg_config | Regularizer configuration of the model weights, including the regularizer type and weight decay. | message that contains two sub-fields: reg_type and weight_decay. See below for details. | - | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | reg_config.reg_type | The regularizer type. Can be either 'L1'(L1 regularizer), 'L2'(L2 regularizer), or 'none'(No regularizer). | Str type. Should be one of the below: 'L1', 'L2', or 'none'. | - | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | reg_config.weight_decay | The weight decay for the regularizer. | Float type, should be a positive scalar. Usually this number should be smaller than 1.0 | - | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | optimizer | The Optimizer used for the training. Can be either SGD, RMSProp or Adam. | oneof message type that can be one of sgd message, rmsprop message or adam message. See below | - | | | | for the details of each message type. | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | adam | Adam optimizer. | message type that contains the 4 sub-fields: lr, beta_1, beta_2, and epsilon. See the Keras | - | | | | 2.2.4 documentation for the meaning of each field. | | | | | | | | | | **Note**: When the learning rate scheduler is enabled, the learning rate in the optimizer | | | | | is overridden by the learning rate scheduler and the one specified in the optimizer(lr) is irrelevant. | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | sgd | SGD optimizer | message type that contains the following fields: lr, momentum, decayand nesterov. See the Keras | - | | | | 2.2.4 documentation for the meaning of each field. | | | | | | | | | | **Note**: When the learning rate scheduler is enabled, the learning rate in the optimizer is | | | | | overridden by the learning rate scheduler and the one specified in the optimizer(lr) is irrelevant. | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | rmsprop | RMSProp optimizer | message type that contains only one field: lr(learning rate). | - | | | | | | | | | **Note**: When learning rate scheduler is enabled, the learning rate in the optimizer is overridden | | | | | by the learning rate scheduler and the one specified in the optimizer(lr) is irrelevant. | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | lr_scheduler | The learning rate scheduler. | message type that can be stepor soft_start. stepscheduler is the same as stepscheduler in classification, | - | | | | while soft_startis the same as soft_annealin classification. Refer to the classification spec file | | | | | documentation for details. | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | lambda_rpn_regr | The loss scaling factor for RPN deltas regression loss. | Float typer. Should be a positive scalar. | 1.0 | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | lambda_rpn_class | The loss scaling factor for RPN classification loss. | Float type. Should be a positive scalar. | 1.0 | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | lambda_cls_regr | The loss scaling factor for RCNN deltas regression loss. | Float type. Should be a positive scalar. | 1.0 | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | lambda_cls_class | The loss scaling factor for RCNN classification loss. | Float type. Should be a positive scalar. | 1.0 | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | inference_config | The inference configuration for tlt-infer. | message type. See below for details. | - | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | inference_config.images_dir | The absolute path to the image directory that tlt-infer will do inference on. | Str type. Should be a valid Unix path. | - | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | inference_config.model | The absolute path to the the .tlt model that tlt-infer will do inference for. | Str type. Should be a valid Unix path. | - | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | inference_config.detection | The absolute path to the output image directory for the detection result. If the path doesn't exist tlt-infer will | Str type. Should be a valid Unix path. | - | | _image_output_dir | create it. If the directory already contains images tlt-inferwill overwrite them. | | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | inference_config.labels | The absolute path to the directory to save the detected labels in KITTI format. tlt-infer will create it if it doesn't | Str type. Should be a valid Unix path. | - | | _dump_dir | xist beforehand. If it already contains label files, tlt-infer will overwrite them. | | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | inference_config.rpn | The number of top ROI's to be retained before the NMS in Proposal layer. | unsigned int, positive. | - | | _pre_nms_top_N | | | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | inference_config.rpn | The number of top ROI's to be retained after the NMS in Proposal layer. | unsigned int, positive. | - | | _nms_max_boxes | | | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | inference_config.rpn_nms | The IoU threshold for the NMS in Proposal layer. | Float type, should be in the interval (0, 1). | 0.7 | | _overlap_threshold | | | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | inference_config.bbox | The confidence threshold for the bounding boxes to be regarded as valid detected objects in the images. | Float type, should be in the interval (0, 1). | 0.6 | | _visualize_threshold | | | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | inference_config.classifier | The number of bounding boxes to be retained after the NMS in RCNN. | unsigned int, positive. | 300 | | _nms_max_boxes | | | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | inference_config.classifier | The IoU threshold for the NMS in RCNN. | Float type. Should be in the interval (0, 1). | 0.3 | | _nms_overlap _threshold | | | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | inference_config.bbox | Whether or not to show captions for each bounding box in the detected images. The captions include the class name and | Boolean(True or False) | False | | _caption_on | confidence probability value for each detected object. | | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | inference_config.trt | The TensorRT inference configuration for tlt-inferin TensorRT backend mode. | Message type. This can be not present, and in this case, tlt-inferwill use TLT as a backend for | - | | _inference | | inference. See below for details. | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | inference_config.trt | The model configuration for the tlt-inferin TensorRT backend mode. It is a oneof wrapper of the two possible model | message type, oneof wrapper of trt_engineand etlt_model. See below for details. | - | | _inference.trt_infer_model | configurations: trt_engine and etlt_model. Only one of them can be specified if run tlt-infer in TensorRT backend. If | | | | | trt_engine is provided, tlt-infer will run TensorRT inference on the TensorRT engine file. If .etlt model is provided, | | | | | tlt-infer will run TensorRT inference on the .etlt model. If in INT8 mode a calibration cache file should also be | | | | | provided along with the .etlt model. | | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | inference_config.trt | The absolute path to the TensorRT engine file for tlt-infer in TensorRT backend mode. The engine should be generated via | Str type. | - | | _inference.trt_engine | the tlt-exportor tlt-converter command line tools. | | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | inference_config.trt_inference | The configuration for the .etlt model and the calibration cache(only needed in INT8 mode) for tlt-infer in TensorRT backend | message type that contains two string type sub-fields: model and calibration_cache. See below | - | | .etlt_model | mode. The .etlt model(and calibration cache, if needed) should be generated via the tlt-export command line tool. | for details. | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | inference _config.trt | The absolute path to the .etlt model that tlt-infer will use to run TensorRT based inference. | Str type. | - | | _inference.etlt_model.model | | | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | inference_config.trt | The path to the TensorRT INT8 calibration cache file in the case of tlt-infer run with.etlt model in INT8 mode. | Str type. | - | | _inference.etlt | | | | | _model.calibration _cache | | | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | inference_config.trt | The TensorRT inference data type if tlt-infer runs with TensorRT backend. The data type is only useful when running on a .etlt | String type. Valid values are 'fp32', 'fp16' and'int8'. | 'fp32' | | _inference.trt_data_type | model. In that case, if the data type is 'int8', a calibration cache file should also be provided as mentioned above. If running | | | | | on a TensorRT engine file directly, this field will be ignored since the engine file already contains the data type information. | | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | evaluation_config | The configuration for the tlt-evaluate in FasterRCNN. | message type that contains the below fields. See below for details. | - | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | evaluation_config.model | The absolute path to the .tlt model that tlt-evaluate will do evaluation for. | Str type. Should be a valid Unix path. | - | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | evaluation_config.labels | The absolute path to the directory of detected labels that tlt-evaluate will save. If it doesn't exist, tlt-evaluate will create | Str type. Should be a valid Unix path. | - | | _dump_dir | it. If it already contains label files, tlt-evaluate will overwrite them. | | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | evaluation_config.rpn | The number of top ROIs to be retained before the NMS in Proposal layer in tlt-evaluate. | unsigned int, positive. | - | | _pre_nms_top_N | | | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | evaluation _config.rpn | The number of top ROIs to be retained after the NMS in Proposal layer in tlt-evaluate. | unsigned int, positive. Should be no greater than the evaluation_config.rpn _pre_nms_top_N. | - | | _nms_max_boxes | | | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | evaluation_config.rpn | The IoU threshold for the NMS in Proposal layer in tlt-evaluate. | Float type in the interval (0, 1). | 0.7 | | _nms_iou_threshold | | | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | evaluation_config | The number of top bounding boxes to be retained after the NMS in RCNN in tlt-evaluate. | Unsigned int, positive. | - | | .classifier_nms_max | | | | | _boxes | | | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | evaluation_config.classifier | The IoU threshold for the NMS in RCNN in tlt-evaluate. | Float typer in the interval (0, 1). | 0.3 | | _nms_overlap_threshold | | | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | evaluation_config.object | The confidence threshold above which a bounding box can be regarded as a valid object detected by FasterRCNN. Usually you can use | Float type in the interval (0, 1). | 0.0001 | | _confidence_thres | a small threshold to improve the recall and mAP as in many object detection challenges. | | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ | evaluation_config.use_voc07 | Whether to use the VOC2007 mAP calculation method when computing the mAP of the FasterRCNN model on a specific dataset. If this is | Boolean (True or False) | False | | _11point_metric | False, you can use VOC2012 metric instead. | | | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------+ Specification File for SSD -------------------------- Here is a sample of the SSD spec file. It has 6 major components: :code:`ssd_config`, :code:`training_config`, :code:`eval_config`, :code:`nms_config`, :code:`augmentation_config`, and :code:`dataset_config`. The format of the spec file is a protobuf text(prototxt) message and each of its fields can be either a basic data type or a nested message. .. code:: random_seed: 42 ssd_config { aspect_ratios_global: "[1.0, 2.0, 0.5, 3.0, 1.0/3.0]" scales: "[0.05, 0.1, 0.25, 0.4, 0.55, 0.7, 0.85]" two_boxes_for_ar1: true clip_boxes: false loss_loc_weight: 0.8 focal_loss_alpha: 0.25 focal_loss_gamma: 2.0 variances: "[0.1, 0.1, 0.2, 0.2]" arch: "resnet" nlayers: 18 freeze_bn: false freeze_blocks: 0 } training_config { batch_size_per_gpu: 16 num_epochs: 80 enable_qat: false learning_rate { soft_start_annealing_schedule { min_learning_rate: 5e-5 max_learning_rate: 2e-2 soft_start: 0.15 annealing: 0.8 } } regularizer { type: L1 weight: 3e-5 } } eval_config { validation_period_during_training: 10 average_precision_mode: SAMPLE batch_size: 16 matching_iou_threshold: 0.5 } nms_config { confidence_threshold: 0.01 clustering_iou_threshold: 0.6 top_k: 200 } augmentation_config { preprocessing { output_image_width: 1248 output_image_height: 384 output_image_channel: 3 crop_right: 1248 crop_bottom: 384 min_bbox_width: 1.0 min_bbox_height: 1.0 } spatial_augmentation { hflip_probability: 0.5 vflip_probability: 0.0 zoom_min: 0.7 zoom_max: 1.8 translate_max_x: 8.0 translate_max_y: 8.0 } color_augmentation { hue_rotation_max: 25.0 saturation_shift_max: 0.20000000298 contrast_scale_max: 0.10000000149 contrast_center: 0.5 } } dataset_config { data_sources: { tfrecords_path: "/workspace/tlt-experiments/data/tfrecords/kitti_trainval/kitti_trainval*" image_directory_path: "/workspace/tlt-experiments/data/training" } image_extension: "png" target_class_mapping { key: "car" value: "car" } target_class_mapping { key: "pedestrian" value: "pedestrian" } target_class_mapping { key: "cyclist" value: "cyclist" } target_class_mapping { key: "van" value: "car" } target_class_mapping { key: "person_sitting" value: "pedestrian" } validation_fold: 0 } The top level structure of the spec file is summarized in the table below. Training Config ^^^^^^^^^^^^^^^ The training configuration(:code:`training_config`) defines the parameters needed for the training, evaluation and inference. Details are summarized in the table below. +--------------------+-------------------------------------------------------------------------------------------+-------------------------------+-------------------------------------------------------------------------------+ | **Field** | **Description** | **Data Type and Constraints** | **Recommended/Typical Value** | +--------------------+-------------------------------------------------------------------------------------------+-------------------------------+-------------------------------------------------------------------------------+ | batch_size_per_gpu | The batch size for each GPU, so the effective batch size is batch_size_per_gpu * num_gpus | Unsigned int, positive | - | +--------------------+-------------------------------------------------------------------------------------------+-------------------------------+-------------------------------------------------------------------------------+ | num_epochs | The anchor batch size used to train the RPN. | Unsigned int, positive | - | +--------------------+-------------------------------------------------------------------------------------------+-------------------------------+-------------------------------------------------------------------------------+ | enable_qat | Whether to use quantization aware training | Boolean | - | +--------------------+-------------------------------------------------------------------------------------------+-------------------------------+-------------------------------------------------------------------------------+ | learning_rate | Only soft_start_annealing_schedule with these nested parameters is supported. | Message type | - | | | | | | | | 1. min_learning_rate: minimum learning late to be seen during the entire experiment. | | | | | 2. max_learning_rate: maximum learning rate to be seen during the entire experiment | | | | | 3. soft_start: Time to be lapsed before warm up ( expressed in percentage of progress | | | | | between 0 and 1) | | | | | 4. annealing: Time to start annealing the learning rate | | | +--------------------+-------------------------------------------------------------------------------------------+-------------------------------+-------------------------------------------------------------------------------+ | regularizer | This parameter configures the regularizer to be used while training and contains the | Message type | L1 | | | following nested parameters. | | | | | | | **Note**: NVIDIA suggests using L1 regularizer when training a network before | | | 1. type: The type or regularizer to use. NVIDIA supports NO_REG, L1 or L2 | | pruning as L1 regularization helps making the network weights more prunable. | | | 2. weight: The floating point value for regularizer weight | | | +--------------------+-------------------------------------------------------------------------------------------+-------------------------------+-------------------------------------------------------------------------------+ Evaluation Config ^^^^^^^^^^^^^^^^^ The evaluation configuration (:code:`eval_config`) defines the parameters needed for the evaluation either during training or standalone. Details are summarized in the table below. +-----------------------------------+----------------------------------------------------------------------------------------------+----------------------------------+-------------------------------+ | **Field** | **Description** | **Data Type and Constraints** | **Recommended/Typical Value** | +-----------------------------------+----------------------------------------------------------------------------------------------+----------------------------------+-------------------------------+ | validation_period_during_training | The number of training epochs per which one validation should run. | Unsigned int, positive | 10 | +-----------------------------------+----------------------------------------------------------------------------------------------+----------------------------------+-------------------------------+ | average_precision_mode | Average Precision (AP) calculation mode can be either SAMPLE or INTEGRATE. SAMPLE | ENUM type ( SAMPLE or INTEGRATE) | SAMPLE | | | is used as VOC metrics for VOC 2009 or before. INTEGRATE is used for VOC 2010 or after that. | | | +-----------------------------------+----------------------------------------------------------------------------------------------+----------------------------------+-------------------------------+ | matching_iou_threshold | The lowest iou of predicted box and ground truth box that can be considered a match. | Boolean | 0.5 | +-----------------------------------+----------------------------------------------------------------------------------------------+----------------------------------+-------------------------------+ NMS Config ^^^^^^^^^^ The NMS configuration (nms_config) defines the parameters needed for the NMS postprocessing. NMS config applies to the NMS layer of the model in training, validation, evaluation, inference and export. Details are summarized in the table below. +-----------------------+------------------------------------------------------------------------------------------------+-------------------------------+-------------------------------+ | **Field** | **Description** | **Data Type and Constraints** | **Recommended/Typical Value** | +-----------------------+------------------------------------------------------------------------------------------------+-------------------------------+-------------------------------+ | confidence_threshold | Boxes with a confidence score less than confidence_threshold are discarded before applying NMS | float | 0.01 | +-----------------------+------------------------------------------------------------------------------------------------+-------------------------------+-------------------------------+ | cluster_iou_threshold | IOU threshold below which boxes will go through NMS process | float | 0.6 | +-----------------------+------------------------------------------------------------------------------------------------+-------------------------------+-------------------------------+ | top_k | top_k boxes will be outputted after the NMS keras layer. If the number of valid boxes is less | Unsigned int | 200 | | | than k, the returned array will be padded with boxes whose confidence score is 0. | | | +-----------------------+------------------------------------------------------------------------------------------------+-------------------------------+-------------------------------+ Augmentation Config ^^^^^^^^^^^^^^^^^^^ The augmentation configuration (:code:`augmentation_config`) defines the parameters needed for data augmentation. The configuration is shared with DetectNet_v2. See :ref:`Augmentation Module` for more information. Dataset Config ^^^^^^^^^^^^^^ The dataset configuration (:code:`dataset_config`) defines the parameters needed for the data loader. The configuration is shared with DetectNet_v2. See :ref:`Dataloader` for more information. SSD config ^^^^^^^^^^ The SSD configuration (:code:`ssd_config`) defines the parameters needed for building the SSD model. Details are summarized in the table below. +----------------------+-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+-------------------------------------------+ | **Field** | **Description** | **Data Type and Constraints** | **Recommended/Typical Value** | +----------------------+-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+-------------------------------------------+ | aspect_ratios_global | Anchor boxes of aspect ratios defined in aspect_ratios_global will be generated for each | string | “[1.0, 2.0, 0.5, 3.0, 0.33]” | | | feature layer used for prediction. Note: Only one of aspect_ratios_global or aspect_ratios is required. | | | +----------------------+-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+-------------------------------------------+ | aspect_ratios | The length of the outer list must be equivalent to the number of feature layers used for anchor | string | “[[1.0,2.0,0.5], | | | box generation. And the i-th layer will have anchor boxes with aspect ratios defined in aspect_ratios[i]. | | [1.0,2.0,0.5], | | | | | [1.0,2.0,0.5], | | | **Note**: Only one of aspect_ratios_global or aspect_ratios is required. | | [1.0,2.0,0.5], | | | | | [1.0,2.0,0.5], | | | | | [1.0, 2.0, 0.5, 3.0, 0.33]]” | +----------------------+-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+-------------------------------------------+ | two_boxes_for_ar1 | This setting is only relevant for layers that have 1.0 as the aspect ratio. If two_boxes_for_ar1 is true, | Boolean | True | | | two boxes will be generated with an aspect ratio of 1. One whose scale is the scale for this layer and the | | | | | other one whose scale is the geometric mean of the scale for this layer and the scale for the next layer. | | | +----------------------+-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+-------------------------------------------+ | clip_boxes | If true, all corner anchor boxes will be truncated so they are fully inside the feature images. | Boolean | False | +----------------------+-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+-------------------------------------------+ | scales | scales is a list of positive floats containing scaling factors per convolutional predictor layer. This list | string | “[0.05, 0.1, 0.25, 0.4, 0.55, 0.7, 0.85]” | | | must be one element longer than the number of predictor layers, so if two_boxes_for_ar1 is true, the second | | | | | aspect ratio 1.0 box for the last layer can have a proper scale. Except for the last element in this list, | | | | | each positive float is the scaling factor for boxes in that layer. For example, if for one layer the scale is | | | | | 0.1, then the generated anchor box with aspect ratio 1 for that layer (the first aspect ratio 1 box if | | | | | two_boxes_for_ar1 is true) will have its height and width as 0.1*min(img_h, img_w). | | | | | | | | | | min_scale and max_scale are two positive floats. If both of them appear in the config, the program can | | | | | automatically generate the scales by evenly splitting the space between min_scale and max_scale. | | | +----------------------+-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+-------------------------------------------+ | min_scale/max_scale | If both appear in the config, scales will be generated evenly by splitting the space between min_scale and | float | - | | | max_scale. | | | +----------------------+-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+-------------------------------------------+ | loss_loc_weight | This is a positive float controlling how much location regression loss should contribute to the final loss. | float | 1.0 | | | The final loss is calculated as classification_loss + loss_loc_weight * loc_loss | | | +----------------------+-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+-------------------------------------------+ | focal_loss_alpha | Alpha is the focal loss equation. | float | 0.25 | +----------------------+-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+-------------------------------------------+ | focal_loss_gamma | Gamma is the focal loss equation. | float | 2.0 | +----------------------+-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+-------------------------------------------+ | variances | Variances should be a list of 4 positive floats. The four floats, in order, represent variances for box center | | | | | x, box center y, log box height, log box width. The box offset for box center (cx, cy) and log box size | | | | | (height/width) w.r.t. anchor will be divided by their respective variance value. Therefore, larger variances | | | | | result in less significant differences between two different boxes on encoded offsets. | | | +----------------------+-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+-------------------------------------------+ | steps | An optional list inside quotation marks whose length is the number of feature layers for prediction. The elements | string | - | | | should be floats or tuples/lists of two floats. Steps define how many pixels apart the anchor box center points | | | | | should be. If the element is a float, both vertical and horizontal margin is the same. Otherwise, the first value | | | | | is step_vertical and the second value is step_horizontal. **If steps are not provided, anchor boxes will be | | | | | distributed uniformly inside the image.** | | | +----------------------+-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+-------------------------------------------+ | offsets | An optional list of floats inside quotation marks whose length is the number of feature layers for prediction. | string | - | | | The first anchor box will have offsets[i]*steps[i] pixels margin from the left and top borders. **If offsets are not | | | | | provided, 0.5 will be used as default value.** | | | +----------------------+-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+-------------------------------------------+ | arch | Backbone for feature extraction. Currently, “resnet”, “vgg”, “darknet”, “googlenet”, “mobilenet_v1”, “mobilenet_v2” | string | resnet | | | and “squeezenet” are supported. | | | +----------------------+-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+-------------------------------------------+ | nlayers | Number of conv layers in specific arch. For “resnet”, 10, 18, 34, 50 and 101 are supported. For “vgg”, 16 and 19 are | Unsigned int | - | | | supported. For “darknet”, 19 and 53 are supported. All other networks don’t have this configuration and users should | | | | | just delete this config from the config file. | | | +----------------------+-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+-------------------------------------------+ | freeze_bn | Whether to freeze all batch normalization layers during training. | boolean | False | +----------------------+-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+-------------------------------------------+ | freeze_blocks | The list of block IDs to be frozen in the model during training. You can choose to freeze some of the CNN blocks in | list(repeated integers) | - | | | the model to make the training more stable and/or easier to converge. The definition of a block is heuristic for a | | | | | specific architecture. For example, by stride or by logical blocks in the model, etc. However, the block ID numbers | • ResNet series. For the ResNet series, the block IDs valid for freezing is | | | | identify the blocks in the model in a sequential order so you don't have to know the exact locations of the blocks when | any subset of [0, 1, 2, 3] (inclusive) | | | | you do training. A general principle to keep in mind is: the smaller the block ID, the closer it is to the model input; | • VGG series. For the VGG series, the block IDs valid for freezing is any subset | | | | the larger the block ID, the closer it is to the model output. | of[1, 2, 3, 4, 5] (inclusive) | | | | | • GoogLeNet. For the GoogLeNet, the block IDs valid for freezing is any subset | | | | You can divide the whole model into several blocks and optionally freeze a subset of it. Note that for FasterRCNN you can | of[0, 1, 2, 3, 4, 5, 6, 7] (inclusive) | | | | only freeze the blocks that are before the ROI pooling layer. Any layer after the ROI pooling layer will not be frozen any | • MobileNet V1. For the MobileNet V1, the block IDs valid for freezing is any subset | | | | way. For different backbones, the number of blocks and the block ID for each block are different. It deserves some detailed | of [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11](inclusive) | | | | explanations on how to specify the block ID's for each backbone. | • MobileNet V2. For the MobileNet V2, the block IDs valid for freezing is any subset | | | | | of [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13](inclusive) | | | | | • DarkNet. For the DarkNet 19 and DarkNet 53, the block IDs valid for freezing is any | | | | | subset of [0, 1, 2, 3, 4, 5](inclusive) | | +----------------------+-----------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+-------------------------------------------+ Specification File for DSSD --------------------------- Below is a sample for the DSSD spec file. It has 6 major components: :code:`dssd_config`, :code:`training_config`, :code:`eval_config`, :code:`nms_config`, :code:`augmentation_config`, and :code:`dataset_config`. The format of the spec file is a protobuf text (prototxt) message and each of its fields can be either a basic data type or a nested message. The top level structure of the spec file is summarized in the table below. .. code:: random_seed: 42 dssd_config { aspect_ratios_global: "[1.0, 2.0, 0.5, 3.0, 1.0/3.0]" scales: "[0.05, 0.1, 0.25, 0.4, 0.55, 0.7, 0.85]" two_boxes_for_ar1: true clip_boxes: false loss_loc_weight: 0.8 focal_loss_alpha: 0.25 focal_loss_gamma: 2.0 variances: "[0.1, 0.1, 0.2, 0.2]" arch: "resnet" nlayers: 18 pred_num_channels: 512 freeze_bn: false freeze_blocks: 0 } training_config { batch_size_per_gpu: 16 num_epochs: 80 enable_qat: false learning_rate { soft_start_annealing_schedule { min_learning_rate: 5e-5 max_learning_rate: 2e-2 soft_start: 0.15 annealing: 0.8 } } regularizer { type: L1 weight: 3e-5 } } eval_config { validation_period_during_training: 10 average_precision_mode: SAMPLE batch_size: 16 matching_iou_threshold: 0.5 } nms_config { confidence_threshold: 0.01 clustering_iou_threshold: 0.6 top_k: 200 } augmentation_config { preprocessing { output_image_width: 1248 output_image_height: 384 output_image_channel: 3 crop_right: 1248 crop_bottom: 384 min_bbox_width: 1.0 min_bbox_height: 1.0 } spatial_augmentation { hflip_probability: 0.5 vflip_probability: 0.0 zoom_min: 0.7 zoom_max: 1.8 translate_max_x: 8.0 translate_max_y: 8.0 } color_augmentation { hue_rotation_max: 25.0 saturation_shift_max: 0.20000000298 contrast_scale_max: 0.10000000149 contrast_center: 0.5 } } dataset_config { data_sources: { tfrecords_path: "/workspace/tlt-experiments/data/tfrecords/kitti_trainval/kitti_trainval*" image_directory_path: "/workspace/tlt-experiments/data/training" } image_extension: "png" target_class_mapping { key: "car" value: "car" } target_class_mapping { key: "pedestrian" value: "pedestrian" } target_class_mapping { key: "cyclist" value: "cyclist" } target_class_mapping { key: "van" value: "car" } target_class_mapping { key: "person_sitting" value: "pedestrian" } validation_fold: 0 } Training Config ^^^^^^^^^^^^^^^ The training configuration (:code:`training_config`) defines the parameters needed for the training, evaluation and inference. Details are summarized in the table below. +--------------------+---------------------------------------------------------------------------------------+-------------------------------+---------------------------------------------------------------------------------------+ | **Field** | **Description** | **Data Type and Constraints** | **Recommended/Typical Value** | +--------------------+---------------------------------------------------------------------------------------+-------------------------------+---------------------------------------------------------------------------------------+ | batch_size_per_gpu | The batch size for each GPU, so the effective batch size is | Unsigned int, positive | - | | | batch_size_per_gpu * num_gpus | | | +--------------------+---------------------------------------------------------------------------------------+-------------------------------+---------------------------------------------------------------------------------------+ | num_epochs | The anchor batch size used to train the RPN. | Unsigned int, positive. | - | +--------------------+---------------------------------------------------------------------------------------+-------------------------------+---------------------------------------------------------------------------------------+ | enable_qat | Whether to use quantization aware training | Boolean | - | +--------------------+---------------------------------------------------------------------------------------+-------------------------------+---------------------------------------------------------------------------------------+ | learning_rate | Only soft_start_annealing_schedule with these nested parameters is supported. | Message type. | - | | | | | | | | 1. min_learning_rate: minimum learning late to be seen during the entire experiment. | | | | | 2. max_learning_rate: maximum learning rate to be seen during the entire experiment | | | | | 3. soft_start: Time to be lapsed before warm up ( expressed in percentage of progress | | | | | between 0 and 1) | | | | | 4. annealing: Time to start annealing the learning rate | | | +--------------------+---------------------------------------------------------------------------------------+-------------------------------+---------------------------------------------------------------------------------------+ | regularizer | This parameter configures the regularizer to be used while training and contains | Message type. | L1 (Note: NVIDIA suggests using L1 regularizer when training a network before pruning | | | the following nested parameters: | | as L1 regularization helps making the network weights more prunable.) | | | | | | | | 1. type: The type or regularizer to use. NVIDIA supports NO_REG, L1 or L2 | | | | | 2. weight: The floating point value for regularizer weight | | | +--------------------+---------------------------------------------------------------------------------------+-------------------------------+---------------------------------------------------------------------------------------+ Evaluation Config ^^^^^^^^^^^^^^^^^ The evaluation configuration (:code:`eval_config`) defines the parameters needed for the evaluation either during training or standalone. Details are summarized in the table below. +-----------------------------------+----------------------------------------------------------------------------------------------+----------------------------------+-------------------------------+ | **Field** | **Description** | **Data Type and Constraints** | **Recommended/Typical Value** | +-----------------------------------+----------------------------------------------------------------------------------------------+----------------------------------+-------------------------------+ | validation_period_during_training | The number of training epochs per which one validation should run. | Unsigned int, positive | 10 | +-----------------------------------+----------------------------------------------------------------------------------------------+----------------------------------+-------------------------------+ | average_precision_mode | Average Precision (AP) calculation mode can be either SAMPLE or INTEGRATE. SAMPLE | ENUM type ( SAMPLE or INTEGRATE) | SAMPLE | | | is used as VOC metrics for VOC 2009 or before. INTEGRATE is used for VOC 2010 or after that. | | | +-----------------------------------+----------------------------------------------------------------------------------------------+----------------------------------+-------------------------------+ | matching_iou_threshold | The lowest iou of predicted box and ground truth box that can be considered a match. | Boolean | 0.5 | +-----------------------------------+----------------------------------------------------------------------------------------------+----------------------------------+-------------------------------+ NMS Config ^^^^^^^^^^ The NMS configuration (:code:`nms_config`) defines the parameters needed for the NMS postprocessing. NMS config applies to the NMS layer of the model in training, validation, evaluation, inference and export. Details are summarized in the table below. +-----------------------+------------------------------------------------------------------------------------------------+-------------------------------+-------------------------------+ | **Field** | **Description** | **Data Type and Constraints** | **Recommended/Typical Value** | +-----------------------+------------------------------------------------------------------------------------------------+-------------------------------+-------------------------------+ | confidence_threshold | Boxes with a confidence score less than confidence_threshold are discarded before applying NMS | float | 0.01 | +-----------------------+------------------------------------------------------------------------------------------------+-------------------------------+-------------------------------+ | cluster_iou_threshold | IOU threshold below which boxes will go through NMS process | float | 0.6 | +-----------------------+------------------------------------------------------------------------------------------------+-------------------------------+-------------------------------+ | top_k | top_k boxes will be outputted after the NMS keras layer. If the number of valid boxes is less | Unsigned int | 200 | | | than k, the returned array will be padded with boxes whose confidence score is 0. | | | +-----------------------+------------------------------------------------------------------------------------------------+-------------------------------+-------------------------------+ Augmentation Config ^^^^^^^^^^^^^^^^^^^ The augmentation configuration (:code:`augmentation_config`) defines the parameters needed for data augmentation. The configuration is shared with DetectNet_v2. See Augmentation module for more information. Dataset Config ^^^^^^^^^^^^^^ The dataset configuration (:code:`dataset_config`) defines the parameters needed for the data loader. The configuration is shared with DetectNet_v2. See Dataloader for more information. DSSD Config ^^^^^^^^^^^ The DSSD configuration (:code:`dssd_config`) defines the parameters needed for building the DSSD model. Details are summarized in the table below. +----------------------+---------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+ | **Field** | **Description** | **Data Type and Constraints** | **Recommended/Typical Value** | +----------------------+---------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+ | aspect_ratios_global | Anchor boxes of aspect ratios defined in aspect_ratios_global will be generated | string | “[1.0, 2.0, 0.5, 3.0, 0.33]” | | | for each feature layer used for prediction. Note: Only one of aspect_ratios_global | | | | | or aspect_ratios is required. | | | +----------------------+---------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+ | aspect_ratios | The length of the outer list must be equivalent to the number of feature layers | string | “[[1.0,2.0,0.5], | | | used for anchor box generation. And the i-th layer will have anchor boxes with aspect | | [1.0,2.0,0.5], | | | ratios defined in aspect_ratios[i]. Note: Only one of aspect_ratios_global or | | [1.0,2.0,0.5], | | | aspect_ratios is required. | | [1.0,2.0,0.5], | | | | | [1.0,2.0,0.5], | | | | | [1.0, 2.0, 0.5, 3.0, 0.33]]” | +----------------------+---------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+ | two_boxes_for_ar1 | This setting is only relevant for layers that have 1.0 as the aspect ratio. If two_boxes_for_ar1 | Boolean | True | | | is true, two boxes will be generated with an aspect ratio of 1. One whose scale is the scale for | | | | | this layer and the other one whose scale is the geometric mean of the scale for this layer and the | | | | | scale for the next layer. | | | +----------------------+---------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+ | clip_boxes | If true, all corner anchor boxes will be truncated so they are fully inside the feature images. | Boolean | False | +----------------------+---------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+ | scales | scales is a list of positive floats containing scaling factors per convolutional predictor layer. | string | “[0.05, 0.1, 0.25, 0.4, 0.55, 0.7, 0.85]” | | | This list must be one element longer than the number of predictor layers, so if two_boxes_for_ar1 is | | | | | true, the second aspect ratio 1.0 box for the last layer can have a proper scale. Except for the last | | | | | element in this list, each positive float is the scaling factor for boxes in that layer. For example, | | | | | if for one layer the scale is 0.1, then the generated anchor box with aspect ratio 1 for that layer | | | | | (the first aspect ratio 1 box if two_boxes_for_ar1 is true) will have its height and width as | | | | | 0.1*min(img_h, img_w). | | | | | | | | | | min_scale and max_scale are two positive floats. If both of them appear in the config, the program | | | | | can automatically generate the scales by evenly splitting the space between min_scale and max_scale. | | | +----------------------+---------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+ | min_scale/max_scale | If both appear in the config, scales will be generated evenly by splitting the space between min_scale | float | - | | | and max_scale. | | | +----------------------+---------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+ | loss_loc_weight | This is a positive float controlling how much location regression loss should contribute to the final | float | 1.0 | | | loss. The final loss is calculated as classification_loss + loss_loc_weight * loc_loss | | | +----------------------+---------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+ | focal_loss_alpha | Alpha is the focal loss equation. | float | 0.25 | +----------------------+---------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+ | focal_loss_gamma | Gamma is the focal loss equation. | float | 2.0 | +----------------------+---------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+ | variances | Variances should be a list of 4 positive floats. The four floats, in order, represent variances for box | | | | | center x, box center y, log box height, log box width. The box offset for box center (cx, cy) and log box | | | | | size (height/width) w.r.t. anchor will be divided by their respective variance value. Therefore, larger | | | | | variances result in less significant differences between two different boxes on encoded offsets. | | | +----------------------+---------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+ | steps | An optional list inside quotation marks whose length is the number of feature layers for prediction. The | string | - | | | elements should be floats or tuples/lists of two floats. Steps define how many pixels apart the anchor box | | | | | center points should be. If the element is a float, both vertical and horizontal margin is the same. Otherwise, | | | | | the first value is step_vertical and the second value is step_horizontal. If steps are not provided, anchor | | | | | boxes will be distributed uniformly inside the image. | | | +----------------------+---------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+ | offsets | An optional list of floats inside quotation marks whose length is the number of feature layers for prediction. | string | - | | | The first anchor box will have offsets[i]*steps[i] pixels margin from the left and top borders. If offsets are | | | | | not provided, 0.5 will be used as default value. | | | +----------------------+---------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+ | arch | Backbone for feature extraction. Currently, “resnet”, “vgg”, “darknet”, “googlenet”, “mobilenet_v1”, | string | resnet | | | “mobilenet_v2” and “squeezenet” are supported. | | | +----------------------+---------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+ | nlayers | Number of conv layers in specific arch. For “resnet”, 10, 18, 34, 50 and 101 are supported. For “vgg”, 16 and | Unsigned int | - | | | 19 are supported. For “darknet”, 19 and 53 are supported. All other networks don’t have this configuration and | | | | | users should just delete this config from the config file. | | | +----------------------+---------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+ | pred_num_channels | This setting controls the number of channels of the convolutional layers in the DSSD prediction module. Setting | Unsigned int | 512 | | | this value to 0 will disable the DSSD prediction module. Supported values for this setting are 0, 256, 512 and | | | | | 1024. A larger value gives a larger network and usually means the network is harder to train. | | | +----------------------+---------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+ | freeze_bn | Whether to freeze all batch normalization layers during training. | boolean | False | +----------------------+---------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+ | freeze_blocks | The list of block IDs to be frozen in the model during training. You can choose to freeze some of the CNN blocks | list(repeated integers) | - | | | in the model to make the training more stable and/or easier to converge. The definition of a block is heuristic | | | | | for a specific architecture. For example, by stride or by logical blocks in the model, etc. However, the block ID | • ResNet series. For the ResNet series, the block IDs valid | | | | numbers identify the blocks in the model in a sequential order so you don't have to know the exact locations of the | for freezing is any subset of [0, 1, 2, 3] (inclusive) | | | | blocks when you do training. A general principle to keep in mind is: the smaller the block ID, the closer it is to | • VGG series. For the VGG series, the block IDs valid for | | | | the model input; the larger the block ID, the closer it is to the model output. | freezing is any subset of[1, 2, 3, 4, 5] (inclusive) | | | | | • GoogLeNet. For the GoogLeNet, the block IDs valid for freezing | | | | You can divide the whole model into several blocks and optionally freeze a subset of it. Note that for FasterRCNN | is any subset of[0, 1, 2, 3, 4, 5, 6, 7] (inclusive) | | | | you can only freeze the blocks that are before the ROI pooling layer. Any layer after the ROI pooling layer will | • MobileNet V1. For the MobileNet V1, the block IDs valid for freezing | | | | not be frozen any way. For different backbones, the number of blocks and the block ID for each block are different. | is any subset of [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11](inclusive) | | | | It deserves some detailed explanations on how to specify the block ID's for each backbone. | • MobileNet V2. For the MobileNet V2, the block IDs valid for freezing | | | | | is any subset of [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13](inclusive) | | | | | • DarkNet. For the DarkNet 19 and DarkNet 53, the block IDs valid for freezing | | | | | is any subset of [0, 1, 2, 3, 4, 5](inclusive) | | +----------------------+---------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+ .. code:: dssd_config { aspect_ratios_global: "[1.0, 2.0, 0.5, 3.0, 0.33]" scales: "[0.1, 0.24166667, 0.38333333, 0.525, 0.66666667, 0.80833333, 0.95]" two_boxes_for_ar1: true clip_boxes: false loss_loc_weight: 1.0 focal_loss_alpha: 0.25 focal_loss_gamma: 2.0 variances: "[0.1, 0.1, 0.2, 0.2]" pred_num_channels: 0 arch: "resnet" nlayers: 18 freeze_bn: True freeze_blocks: 0 freeze_blocks: 1} Using aspect_ratios_global or aspect_ratios ******************************************* .. Note:: Only :code:`aspect_ratios_global` or :code:`aspect_ratios` is required. :code:`aspect_ratios_global` should be a 1-d array inside quotation marks. Anchor boxes of aspect ratios defined in :code:`aspect_ratios_global` will be generated for each feature layer used for prediction. Example: "[1.0, 2.0, 0.5, 3.0, 0.33]" :code:`aspect_ratios` should be a list of lists inside quotation marks. The length of the outer list must be equivalent to the number of feature layers used for anchor box generation. And the i-th layer will have anchor boxes with aspect ratios defined in aspect_ratios[i]. Here's an example: .. code:: "[[1.0,2.0,0.5], [1.0,2.0,0.5], [1.0,2.0,0.5], [1.0,2.0,0.5], [1.0,2.0,0.5], [1.0, 2.0, 0.5, 3.0, 0.33]]" two_boxes_for_ar1 ***************** This setting is only relevant for layers that have 1.0 as the aspect ratio. If :code:`two_boxes_for_ar1` is true, two boxes will be generated with an aspect ratio of 1. One whose scale is the scale for this layer and the other one whose scale is the geometric mean of the scale for this layer and the scale for the next layer. scales or Combination of min_scale and max_scale ************************************************ .. Note:: Only :code:`scales` or the combination of :code:`min_scale` and :code:`max_scale` is required. :code:`scales` should be a 1-d array inside quotation marks. It is a list of positive floats containing scaling factors per convolutional predictor layer. This list must be one element longer than the number of predictor layers, so if :code:`two_boxes_for_ar1` is true, the second aspect ratio 1.0 box for the last layer can have a proper scale. Except for the last element in this list, each positive float is the scaling factor for boxes in that layer. For example, if for one layer the scale is 0.1, then the generated anchor box with aspect ratio 1 for that layer (the first aspect ratio 1 box if two_boxes_for_ar1 is true) will have its height and width as `0.1*min(img_h, img_w)`. :code:`min_scale` and :code:`max_scale` are two positive floats. If both of them appear in the config, the program can automatically generate the scales by evenly splitting the space between :code:`min_scale` and :code:`max_scale`. clip_boxes ********** If true, all corner anchor boxes will be truncated so they are fully inside the feature images. loss_loc_weight *************** This is a positive float controlling how much location regression loss should contribute to the final loss. The final loss is calculated as `classification_loss + loss_loc_weight * loc_loss`. focal_loss_alpha and focal_loss_gamma ************************************* Focal loss is calculated as: .. image:: ../content/focal_loss_formula.png :code:`focal_loss_alpha` defines `α` and :code:`focal_loss_gamma` defines `γ` in the formula. NVIDIA recommends `α=0.25` and `γ=2.0` if you don't know what values to use. variances ********* Variances should be a list of 4 positive floats. The four floats, in order, represent variances for box center x, box center y, log box height, log box width. The box offset for box center (cx, cy) and log box size (height/width) w.r.t. anchor will be divided by their respective variance value. Therefore, larger variances result in less significant differences between two different boxes on encoded offsets. The formula for offset calculation is: .. image:: ../content/variance_offset_calc.png steps ***** An optional list inside quotation marks whose length is the number of feature layers for prediction. The elements should be floats or tuples/lists of two floats. Steps define how many pixels apart the anchor box center points should be. If the element is a float, both vertical and horizontal margin is the same. Otherwise, the first value is :code:`step_vertical` and the second value is :code:`step_horizontal`. If steps are not provided, anchorboxes will be distributed uniformly inside the image. offsets ******* An optional list of floats inside quotation marks whose length is the number of feature layers for prediction. The first anchor box will have offsets[i]*steps[i] pixels margin from the left and top borders. If offsets are not provided, 0.5 will be used as default value. arch **** A string indicating which feature extraction architecture you want to use. Currently, “resnet”, “vgg”, “darknet”, “googlenet”, “mobilenet_v1”, “mobilenet_v2” and “squeezenet” are supported. nlayers ******* An integer specifying the number of layers of the selected arch. For “resnet”, 10, 18, 34, 50 and 101 are supported. For “vgg”, 16 and 19 are supported. For “darknet”, 19 and 53 are supported. All other networks don’t have this configuration and users should just delete this config from the config file. freeze_bn ********* Whether to freeze all batch normalization layers during training. freeze_blocks ************* Optionally, you can have more than 1 :code:`freeze_blocks` field. Weights of layers in those blocks will be freezed during training. See :ref:`Model config` for more information. Specification File for RetinaNet -------------------------------- Below is a sample for the RetinaNet spec file. It has 6 major components: :code:`retinanet_config`, :code:`training_config`, :code:`eval_config`, :code:`nms_config`, :code:`augmentation_config` and :code:`dataset_config`. The format of the spec file is a protobuf text (prototxt) message and each of its fields can be either a basic data type or a nested message. The top level structure of the spec file is summarized in the table below: .. code:: random_seed: 42 retinanet_config { aspect_ratios_global: "[1.0, 2.0, 0.5]" scales: "[0.045, 0.09, 0.2, 0.4, 0.55, 0.7]" two_boxes_for_ar1: false clip_boxes: false loss_loc_weight: 0.8 focal_loss_alpha: 0.25 focal_loss_gamma: 2.0 variances: "[0.1, 0.1, 0.2, 0.2]" arch: "resnet" nlayers: 18 n_kernels: 1 feature_size: 256 freeze_bn: false freeze_blocks: 0 } training_config { enable_qat: False batch_size_per_gpu: 24 num_epochs: 100 learning_rate { soft_start_annealing_schedule { min_learning_rate: 4e-5 max_learning_rate: 1.5e-2 soft_start: 0.15 annealing: 0.5 } } regularizer { type: L1 weight: 2e-5 } } eval_config { validation_period_during_training: 10 average_precision_mode: SAMPLE batch_size: 32 matching_iou_threshold: 0.5 } nms_config { confidence_threshold: 0.01 clustering_iou_threshold: 0.6 top_k: 200 } augmentation_config { preprocessing { output_image_width: 1248 output_image_height: 384 output_image_channel: 3 crop_right: 1248 crop_bottom: 384 min_bbox_width: 1.0 min_bbox_height: 1.0 } spatial_augmentation { hflip_probability: 0.5 vflip_probability: 0.0 zoom_min: 0.7 zoom_max: 1.8 translate_max_x: 8.0 translate_max_y: 8.0 } color_augmentation { hue_rotation_max: 25.0 saturation_shift_max: 0.2 contrast_scale_max: 0.1 contrast_center: 0.5 } } dataset_config { data_sources: { tfrecords_path: "/workspace/tlt-experiments/data/tfrecords/kitti_trainval/kitti_trainval*" image_directory_path: "/workspace/tlt-experiments/data/training" } image_extension: "png" target_class_mapping { key: "car" value: "car" } target_class_mapping { key: "pedestrian" value: "pedestrian" } target_class_mapping { key: "cyclist" value: "cyclist" } target_class_mapping { key: "van" value: "car" } target_class_mapping { key: "person_sitting" value: "pedestrian" } validation_fold: 0 } Training Config ^^^^^^^^^^^^^^^ The training configuration(:code:`training_config`) defines the parameters needed for the training, evaluation and inference. Details are summarized in the table below. +--------------------+--------------------------------------------------------------------------------------------------------+-------------------------------+--------------------------------------------------------------------------------------+ | **Field** | **Description** | **Data Type and Constraints** | **Recommended/Typical Value** | +--------------------+--------------------------------------------------------------------------------------------------------+-------------------------------+--------------------------------------------------------------------------------------+ | batch_size_per_gpu | The batch size for each GPU, so the effective batch size is batch_size_per_gpu * num_gpus | Unsigned int, positive | - | +--------------------+--------------------------------------------------------------------------------------------------------+-------------------------------+--------------------------------------------------------------------------------------+ | num_epochs | The anchor batch size used to train the RPN. | Unsigned int, positive. | - | +--------------------+--------------------------------------------------------------------------------------------------------+-------------------------------+--------------------------------------------------------------------------------------+ | enable_qat | Whether to use quantization aware training | Boolean | - | +--------------------+--------------------------------------------------------------------------------------------------------+-------------------------------+--------------------------------------------------------------------------------------+ | learning_rate | Only soft_start_annealing_schedule with these nested parameters is supported. | Message type. | - | | | | | | | | 1. min_learning_rate: minimum learning late to be seen during the entire experiment. | | | | | 2. max_learning_rate: maximum learning rate to be seen during the entire experiment | | | | | 3. soft_start: Time to be lapsed before warm up ( expressed in percentage of progress between 0 and 1) | | | | | 4. annealing: Time to start annealing the learning rate | | | +--------------------+--------------------------------------------------------------------------------------------------------+-------------------------------+--------------------------------------------------------------------------------------+ | regularizer | This parameter configures the regularizer to be used while training and contains | Message type. | L1 (Note: NVIDIA suggests using L1 regularizer when training a network | | | the following nested parameters. | | before pruning as L1 regularization helps making the network weights more prunable.) | | | | | | | | 1. type: The type or regularizer to use. NVIDIA supports NO_REG, L1 or L2 | | | | | 2. weight: The floating point value for regularizer weight | | | +--------------------+--------------------------------------------------------------------------------------------------------+-------------------------------+--------------------------------------------------------------------------------------+ Evaluation Config ^^^^^^^^^^^^^^^^^ The evaluation configuration (:code:`eval_config`) defines the parameters needed for the evaluation either during training or standalone. Details are summarized in the table below. +-----------------------------------+-------------------------------------------------------------------------------------------+----------------------------------+-------------------------------+ | **Field** | **Description** | **Data Type and Constraints** | **Recommended/Typical Value** | +-----------------------------------+-------------------------------------------------------------------------------------------+----------------------------------+-------------------------------+ | validation_period_during_training | The number of training epochs per which one validation should run. | Unsigned int, positive | 10 | +-----------------------------------+-------------------------------------------------------------------------------------------+----------------------------------+-------------------------------+ | average_precision_mode | Average Precision (AP) calculation mode can be either SAMPLE or INTEGRATE. SAMPLE is | ENUM type ( SAMPLE or INTEGRATE) | SAMPLE | | | used as VOC metrics for VOC 2009 or before. INTEGRATE is used for VOC 2010 or after that. | | | +-----------------------------------+-------------------------------------------------------------------------------------------+----------------------------------+-------------------------------+ | matching_iou_threshold | The lowest iou of predicted box and ground truth box that can be considered a match. | Boolean | 0.5 | +-----------------------------------+-------------------------------------------------------------------------------------------+----------------------------------+-------------------------------+ NMS Config ^^^^^^^^^^ The NMS configuration (:code:`nms_config`) defines the parameters needed for the NMS postprocessing. NMS config applies to the NMS layer of the model in training, validation, evaluation, inference and export. Details are summarized in the table below. +-----------------------+-------------------------------------------------------------------------------------+-------------------------------+-------------------------------+ | **Field** | **Description** | **Data Type and Constraints** | **Recommended/Typical Value** | +-----------------------+-------------------------------------------------------------------------------------+-------------------------------+-------------------------------+ | confidence_threshold | Boxes with a confidence score less than confidence_threshold are discarded before | float | 0.01 | | | applying NMS | | | +-----------------------+-------------------------------------------------------------------------------------+-------------------------------+-------------------------------+ | cluster_iou_threshold | IOU threshold below which boxes will go through NMS process | float | 0.6 | +-----------------------+-------------------------------------------------------------------------------------+-------------------------------+-------------------------------+ | top_k | top_k boxes will be outputted after the NMS keras layer. If the number of valid | Unsigned int | 200 | | | boxes is less than k, the returned array will be padded with boxes whose confidence | | | | | score is 0. | | | +-----------------------+-------------------------------------------------------------------------------------+-------------------------------+-------------------------------+ Augmentation Config ^^^^^^^^^^^^^^^^^^^ The augmentation configuration (:code:`augmentation_config`) defines the parameters needed for data augmentation. The configuration is shared with DetectNet_v2. See :ref:`Augmentation Module ` for more information. Dataset Config ^^^^^^^^^^^^^^ The dataset configuration (:code:`dataset_config`) defines the parameters needed for the data loader. The configuration is shared with DetectNet_v2. See :ref:`Dataloader` for more information. RetinaNet Config ^^^^^^^^^^^^^^^^ The RetinaNet configuration (:code:`retinanet_config`) defines the parameters needed for building the RetinaNet model. Details are summarized in the table below. +----------------------+---------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+ | **Field** | **Description** | **Data Type and Constraints** | **Recommended/Typical Value** | +----------------------+---------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+ | aspect_ratios_global | Anchor boxes of aspect ratios defined in aspect_ratios_global will be generated for each feature layer | string | “[1.0, 2.0, 0.5]” | | | used for prediction. Note: Only one of aspect_ratios_global or aspect_ratios is required. | | | +----------------------+---------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+ | aspect_ratios | The length of the outer list must be equivalent to the number of feature layers used for anchor box | string | “[[1.0,2.0,0.5], | | | generation. And the i-th layer will have anchor boxes with aspect ratios defined in aspect_ratios[i]. | | [1.0,2.0,0.5], | | | | | [1.0,2.0,0.5], | | | **Note**: Only one of aspect_ratios_global or aspect_ratios is required. | | [1.0,2.0,0.5], | | | | | [1.0,2.0,0.5], | | | | | [1.0, 2.0, 0.5, 3.0, 0.33]]” | +----------------------+---------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+ | two_boxes_for_ar1 | This setting is only relevant for layers that have 1.0 as the aspect ratio. If two_boxes_for_ar1 is | Boolean | True | | | true, two boxes will be generated with an aspect ratio of 1. One whose scale is the scale for this | | | | | layer and the other one whose scale is the geometric mean of the scale for this layer and the scale for | | | | | the next layer. | | | +----------------------+---------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+ | clip_boxes | If true, all corner anchor boxes will be truncated so they are fully inside the feature images. | Boolean | False | +----------------------+---------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+ | scales | scales is a list of positive floats containing scaling factors per convolutional predictor layer. This | string | “[0.05, 0.1, 0.25, 0.4, 0.55, 0.7, 0.85]” | | | list must be one element longer than the number of predictor layers, so if two_boxes_for_ar1 is true, | | | | | the second aspect ratio 1.0 box for the last layer can have a proper scale. Except for the last element | | | | | in this list, each positive float is the scaling factor for boxes in that layer. For example, if for one | | | | | layer the scale is 0.1, then the generated anchor box with aspect ratio 1 for that layer (the first aspect | | | | | ratio 1 box if two_boxes_for_ar1 is true) will have its height and width as 0.1*min(img_h, img_w). | | | | | | | | | | min_scale and max_scale are two positive floats. If both of them appear in the config, the program can | | | | | automatically generate the scales by evenly splitting the space between min_scale and max_scale. | | | +----------------------+---------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+ | min_scale/max_scale | If both appear in the config, scales will be generated evenly by splitting the space between min_scale and | float | - | | | max_scale. | | | +----------------------+---------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+ | loss_loc_weight | This is a positive float controlling how much location regression loss should contribute to the final loss. | float | 1.0 | | | The final loss is calculated as classification_loss + loss_loc_weight * loc_loss | | | +----------------------+---------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+ | focal_loss_alpha | Alpha is the focal loss equation. | float | 0.25 | +----------------------+---------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+ | focal_loss_gamma | Gamma is the focal loss equation. | float | 2.0 | +----------------------+---------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+ | variances | Variances should be a list of 4 positive floats. The four floats, in order, represent variances for box center x, | | | | | box center y, log box height, log box width. The box offset for box center (cx, cy) and log box size (height/width) | | | | | w.r.t. anchor will be divided by their respective variance value. Therefore, larger variances result in less | | | | | significant differences between two different boxes on encoded offsets. | | | +----------------------+---------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+ | steps | An optional list inside quotation marks whose length is the number of feature layers for prediction. The elements | string | - | | | should be floats or tuples/lists of two floats. Steps define how many pixels apart the anchor box center points should | | | | | be. If the element is a float, both vertical and horizontal margin is the same. Otherwise, the first value is | | | | | step_vertical and the second value is step_horizontal. If steps are not provided, anchor boxes will be distributed | | | | | uniformly inside the image. | | | +----------------------+---------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+ | offsets | An optional list of floats inside quotation marks whose length is the number of feature layers for prediction. The first | string | - | | | anchor box will have offsets[i]*steps[i] pixels margin from the left and top borders. If offsets are not provided, 0.5 | | | | | will be used as default value. | | | +----------------------+---------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+ | arch | Backbone for feature extraction. Currently, “resnet”, “vgg”, “darknet”, “googlenet”, “mobilenet_v1”, “mobilenet_v2” and | string | resnet | | | “squeezenet” are supported. | | | +----------------------+---------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+ | nlayers | Number of conv layers in specific arch. For “resnet”, 10, 18, 34, 50 and 101 are supported. For “vgg”, 16 and 19 are | Unsigned int | - | | | supported. For “darknet”, 19 and 53 are supported. All other networks don’t have this configuration and users should | | | | | just delete this config from the config file. | | | +----------------------+---------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+ | freeze_bn | Whether to freeze all batch normalization layers during training. | boolean | False | +----------------------+---------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+ | freeze_blocks | The list of block IDs to be frozen in the model during training. You can choose to freeze some of the CNN blocks in the model | list(repeated integers) | - | | | to make the training more stable and/or easier to converge. The definition of a block is heuristic for a specific architecture. | | | | | For example, by stride or by logical blocks in the model, etc. However, the block ID numbers identify the blocks in the model | • ResNet series. For the ResNet series, the block IDs valid | | | | in a sequential order so you don't have to know the exact locations of the blocks when you do training. A general principle to | for freezing is any subset of [0, 1, 2, 3] (inclusive) | | | | keep in mind is: the smaller the block ID, the closer it is to the model input; the larger the block ID, the closer it is to | • VGG series. For the VGG series, the block IDs valid for | | | | the model output. | freezing is any subset of[1, 2, 3, 4, 5] (inclusive) | | | | | • GoogLeNet. For the GoogLeNet, the block IDs valid for freezing | | | | You can divide the whole model into several blocks and optionally freeze a subset of it. Note that for FasterRCNN you can only | is any subset of[0, 1, 2, 3, 4, 5, 6, 7] (inclusive) | | | | freeze the blocks that are before the ROI pooling layer. Any layer after the ROI pooling layer will not be frozen any way. For | • MobileNet V1. For the MobileNet V1, the block IDs valid for | | | | different backbones, the number of blocks and the block ID for each block are different. It deserves some detailed explanations | freezing is any subset of [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11](inclusive) | | | | on how to specify the block ID's for each backbone. | • MobileNet V2. For the MobileNet V2, the block IDs valid for freezing | | | | | is any subset of [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13](inclusive) | | | | | • DarkNet. For the DarkNet 19 and DarkNet 53, the block IDs valid for freezing | | | | | is any subset of [0, 1, 2, 3, 4, 5](inclusive) | | +----------------------+---------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+ | n_kernels | This setting controls the number of convolutional layers in the RetinaNet subnets for classification and anchor box regression. | Unsigned int | 2 | | | A larger value generates a larger network and usually means the network is harder to train. | | | +----------------------+---------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+ | feature_size | This setting controls the number of channels of the convolutional layers in the RetinaNet subnets for classification and anchor | Unsigned int | 256 | | | box regression. A larger value gives a larger network and usually means the network is harder to train. | | | | | | | | | | Note that RetinaNet FPN generates 5 feature maps, thus the scales field requires a list of 6 scaling factors. The last number | | | | | is not used if two_boxes_for_ar1 is set to False. There are also three underlying scaling factors at each feature map level | | | | | (2^0, 2^⅓, 2^⅔ ). | | | +----------------------+---------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+ Focal loss is calculated as follows: .. image:: ../content/focal_loss_formula.png Variances: .. image:: ../content/variance_offset_calc.png Specification File for YOLOv3 ----------------------------- Below is a sample for the YOLOv3 spec file. It has 6 major components: :code:`yolo_config`, :code:`training_config`, :code:`eval_config`, :code:`nms_config`, :code:`augmentation_config`, and :code:`dataset_config`. The format of the spec file is a protobuf text (prototxt) message and each of its fields can be either a basic data type or a nested message. The top level structure of the spec file is summarized in the table below. .. code:: random_seed: 42 yolo_config { big_anchor_shape: "[(116,90), (156,198), (373,326)]" mid_anchor_shape: "[(30,61), (62,45), (59,119)]" small_anchor_shape: "[(10,13), (16,30), (33,23)]" matching_neutral_box_iou: 0.5 arch: "darknet" nlayers: 53 arch_conv_blocks: 2 loss_loc_weight: 5.0 loss_neg_obj_weights: 50.0 loss_class_weights: 1.0 freeze_bn: True freeze_blocks: 0 freeze_blocks: 1} training_config { batch_size_per_gpu: 16 num_epochs: 80 enable_qat: false learning_rate { soft_start_annealing_schedule { min_learning_rate: 5e-5 max_learning_rate: 2e-2 soft_start: 0.15 annealing: 0.8 } } regularizer { type: L1 weight: 3e-5 } } eval_config { validation_period_during_training: 10 average_precision_mode: SAMPLE batch_size: 16 matching_iou_threshold: 0.5 } nms_config { confidence_threshold: 0.01 clustering_iou_threshold: 0.6 top_k: 200 } augmentation_config { preprocessing { output_image_width: 1248 output_image_height: 384 output_image_channel: 3 crop_right: 1248 crop_bottom: 384 min_bbox_width: 1.0 min_bbox_height: 1.0 } spatial_augmentation { hflip_probability: 0.5 vflip_probability: 0.0 zoom_min: 0.7 zoom_max: 1.8 translate_max_x: 8.0 translate_max_y: 8.0 } color_augmentation { hue_rotation_max: 25.0 saturation_shift_max: 0.20000000298 contrast_scale_max: 0.10000000149 contrast_center: 0.5 } } dataset_config { data_sources: { tfrecords_path: "/workspace/tlt-experiments/data/tfrecords/kitti_trainval/kitti_trainval*" image_directory_path: "/workspace/tlt-experiments/data/training" } image_extension: "png" target_class_mapping { key: "car" value: "car" } target_class_mapping { key: "pedestrian" value: "pedestrian" } target_class_mapping { key: "cyclist" value: "cyclist" } target_class_mapping { key: "van" value: "car" } target_class_mapping { key: "person_sitting" value: "pedestrian" } validation_fold: 0 } Training Config ^^^^^^^^^^^^^^^ The training configuration(:code:`training_config`) defines the parameters needed for the training, evaluation and inference. Details are summarized in the table below. +--------------------+---------------------------------------------------------------------------------------+-------------------------------+-------------------------------------------------------------------------------+ | **Field** | **Description** | **Data Type and Constraints** | **Recommended/Typical Value** | +--------------------+---------------------------------------------------------------------------------------+-------------------------------+-------------------------------------------------------------------------------+ | batch_size_per_gpu | The batch size for each GPU, so the effective batch size is | Unsigned int, positive | - | | | batch_size_per_gpu * num_gpus | | | +--------------------+---------------------------------------------------------------------------------------+-------------------------------+-------------------------------------------------------------------------------+ | num_epochs | The anchor batch size used to train the RPN | Unsigned int, positive. | - | +--------------------+---------------------------------------------------------------------------------------+-------------------------------+-------------------------------------------------------------------------------+ | enable_qat | Whether to use quantization aware training | Boolean | - | +--------------------+---------------------------------------------------------------------------------------+-------------------------------+-------------------------------------------------------------------------------+ | learning_rate | Only soft_start_annealing_schedule with these nested parameters is supported. | Message type. | - | | | | | | | | 1. min_learning_rate: minimum learning late to be seen during the entire experiment | | | | | 2. max_learning_rate: maximum learning rate to be seen during the entire experiment | | | | | 3. soft_start: Time to be lapsed before warm up ( expressed in percentage of progress | | | | | between 0 and 1) | | | | | 4. annealing: Time to start annealing the learning rate | | | +--------------------+---------------------------------------------------------------------------------------+-------------------------------+-------------------------------------------------------------------------------+ | regularizer | This parameter configures the regularizer to be used while training and contains | Message type. | L1 (Note: NVIDIA suggests using L1 regularizer when training a network before | | | the following nested parameters. | | pruning as L1 regularization helps making the network weights more prunable.) | | | | | | | | 1. type: The type or regularizer to use. NVIDIA supports NO_REG, L1 or L2 | | | | | 2. weight: The floating point value for regularizer weight | | | +--------------------+---------------------------------------------------------------------------------------+-------------------------------+-------------------------------------------------------------------------------+ Evaluation Config ^^^^^^^^^^^^^^^^^ The evaluation configuration (:code:`eval_config`) defines the parameters needed for the evaluation either during training or standalone. Details are summarized in the table below. +-----------------------------------+-----------------------------------------------------------------------------+----------------------------------+-------------------------------+ | **Field** | **Description** | **Data Type and Constraints** | **Recommended/Typical Value** | +-----------------------------------+-----------------------------------------------------------------------------+----------------------------------+-------------------------------+ | validation_period_during_training | The number of training epochs per which one validation should run. | Unsigned int, positive | 10 | +-----------------------------------+-----------------------------------------------------------------------------+----------------------------------+-------------------------------+ | average_precision_mode | Average Precision (AP) calculation mode can be either SAMPLE or INTEGRATE. | ENUM type ( SAMPLE or INTEGRATE) | SAMPLE | | | SAMPLE is used as VOC metrics for VOC 2009 or before. INTEGRATE is used for | | | | | VOC 2010 or after that. | | | +-----------------------------------+-----------------------------------------------------------------------------+----------------------------------+-------------------------------+ | matching_iou_threshold | The lowest iou of predicted box and ground truth box that can be considered | Boolean | 0.5 | | | a match. | | | +-----------------------------------+-----------------------------------------------------------------------------+----------------------------------+-------------------------------+ NMS Config ^^^^^^^^^^ The NMS configuration (:code:`nms_config`) defines the parameters needed for the NMS postprocessing. NMS config applies to the NMS layer of the model in training, validation, evaluation, inference and export. Details are summarized in the table below. +-----------------------+----------------------------------------------------------------------------+-------------------------------+-------------------------------+ | **Field** | **Description** | **Data Type and Constraints** | **Recommended/Typical Value** | +-----------------------+----------------------------------------------------------------------------+-------------------------------+-------------------------------+ | confidence_threshold | Boxes with a confidence score less than confidence_threshold are discarded | float | 0.01 | | | before applying NMS | | | +-----------------------+----------------------------------------------------------------------------+-------------------------------+-------------------------------+ | cluster_iou_threshold | IOU threshold below which boxes will go through NMS process | float | 0.6 | +-----------------------+----------------------------------------------------------------------------+-------------------------------+-------------------------------+ | top_k | top_k boxes will be outputted after the NMS keras layer. If the number of | Unsigned int | 200 | | | valid boxes is less than k, the returned array will be padded with boxes | | | | | whose confidence score is 0. | | | +-----------------------+----------------------------------------------------------------------------+-------------------------------+-------------------------------+ Augmentation Config ^^^^^^^^^^^^^^^^^^^ The augmentation configuration (:code:`augmentation_config`) defines the parameters needed for data augmentation. The configuration is shared with DetectNet_v2. See :ref:`Augmentation module ` for more information. Dataset Config ^^^^^^^^^^^^^^ The dataset configuration (:code:`dataset_config`) defines the parameters needed for the data loader. The configuration is shared with DetectNet_v2. See :ref:`Dataloader ` for more information. YOLOv3 Config ^^^^^^^^^^^^^ The YOLOv3 configuration (:code:`yolo_config`) defines the parameters needed for building the DSSD model. Details are summarized in the table below. +---------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------+-------------------------------------------------------+ | **Field** | **Description** | **Data Type and Constraints** | **Recommended/Typical Value** | +---------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------+-------------------------------------------------------+ | big_anchor_shape, mid_anchor_shape, and small_anchor_shape | Those settings should be 1-d arrays inside quotation marks. The elements of those arrays | string | Use kmeans.py attached in examples/yolo inside docker | | | are tuples representing the pre-defined anchor shape in the order of width, height. | | to generate those shapes | | | | | | | | The default YOLOv3 has 9 predefined anchor shapes. They are divided into 3 groups | | | | | corresponding to big, medium and small objects. The detection output corresponding to | | | | | different groups are from different depths in the network. Users should run the kmeans.py | | | | | file attached with the example notebook to determine the best anchor shapes for their own | | | | | dataset and put those anchor shapes in the spec file. It is worth noting that the number of | | | | | anchor shapes for any field is not limited to 3. Users only need to specify at least 1 anchor | | | | | shape in each of those three fields. | | | +---------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------+-------------------------------------------------------+ | matching_neutral_box_iou | This field should be a float number between 0 and 1. Any anchor not matching to ground truth | float | 0.5 | | | boxes, but with IOU higher than this float value with any ground truth box, will not have their | | | | | objectiveness loss back-propagated during training. This is to reduce false negatives. | | | +---------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------+-------------------------------------------------------+ | arch_conv_blocks | Supported values are 0, 1 and 2. This value controls how many convolutional blocks are present | 0, 1 or 2 | 2 | | | among detection output layers. Setting this value to 2 if you want to reproduce the meta | | | | | architecture of the original YOLOv3 model coming with DarkNet 53. Please note this config setting | | | | | only controls the size of the YOLO meta architecture and the size of the feature extractor has | | | | | nothing to do with this config field. | | | +---------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------+-------------------------------------------------------+ | loss_loc_weight, loss_neg_obj_weights, and loss_class_weights | Those loss weights can be configured as float numbers. | float | loss_loc_weight: 5.0 | | | | | loss_neg_obj_weights: 50.0 | | | The YOLOv3 loss is a summation of localization loss, negative objectiveness loss, positive | | loss_class_weights: 1.0 | | | objectiveness loss and classification loss. The weight of positive objectiveness loss is set to | | | | | 1 while the weights of other losses are read from config file. | | | +---------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------+-------------------------------------------------------+ | arch | Backbone for feature extraction. Currently, “resnet”, “vgg”, “darknet”, “googlenet”, “mobilenet_v1”, | string | resnet | | | “mobilenet_v2” and “squeezenet” are supported. | | | +---------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------+-------------------------------------------------------+ | nlayers | Number of conv layers in specific arch. For “resnet”, 10, 18, 34, 50 and 101 are supported. For “vgg”, | Unsigned int | - | | | 16 and 19 are supported. For “darknet”, 19 and 53 are supported. All other networks don’t have this | | | | | configuration and users should just delete this config from the config file. | | | +---------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------+-------------------------------------------------------+ | freeze_bn | Whether to freeze all batch normalization layers during training. | boolean | False | +---------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------+-------------------------------------------------------+ | freeze_blocks | The list of block IDs to be frozen in the model during training. You can choose to freeze some of the | list(repeated integers) | - | | | CNN blocks in the model to make the training more stable and/or easier to converge. The definition of | | | | | a block is heuristic for a specific architecture. For example, by stride or by logical blocks in the | • ResNet series. For the ResNet series, the block IDs valid for | | | | model, etc. However, the block ID numbers identify the blocks in the model in a sequential order so you | freezing is any subset of [0, 1, 2, 3] (inclusive) | | | | don't have to know the exact locations of the blocks when you do training. A general principle to keep | • VGG series. For the VGG series, the block IDs valid for freezing | | | | in mind is: the smaller the block ID, the closer it is to the model input; the larger the block ID, the | is any subset of[1, 2, 3, 4, 5] (inclusive) | | | | closer it is to the model output. | • GoogLeNet. For the GoogLeNet, the block IDs valid for freezing | | | | | is any subset of[0, 1, 2, 3, 4, 5, 6, 7] (inclusive) | | | | You can divide the whole model into several blocks and optionally freeze a subset of it. Note that for | • MobileNet V1. For the MobileNet V1, the block IDs valid for freezing | | | | FasterRCNN you can only freeze the blocks that are before the ROI pooling layer. Any layer after the ROI | is any subset of [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11](inclusive) | | | | pooling layer will not be frozen any way. For different backbones, the number of blocks and the block ID | • MobileNet V2. For the MobileNet V2, the block IDs valid for freezing | | | | for each block are different. It deserves some detailed explanations on how to specify the block ID's for | is any subset of [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13](inclusive) | | | | each backbone. | • DarkNet. For the DarkNet 19 and DarkNet 53, the block IDs valid for | | | | | freezing is any subset of [0, 1, 2, 3, 4, 5](inclusive) | | +---------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------+-------------------------------------------------------+ Specification File for MaskRCNN ------------------------------- Below is a sample for the MaskRCNN spec file. It has 3 major components: top level experiment configs, :code:`data_config` and :code:`maskrcnn_config`, explained below in detail. The format of the spec file is a protobuf text (prototxt) message and each of its fields can be either a basic data type or a nested message. The top level structure of the spec file is summarized in the table below. Here's a sample of the MaskRCNN spec file: .. code:: seed: 123 use_amp: False warmup_steps: 0 checkpoint: "/workspace/tlt-experiments/maskrcnn/pretrained_resnet50/tlt_instance_segmentation_vresnet50/resnet50.hdf5" learning_rate_steps: "[60000, 80000, 100000]" learning_rate_decay_levels: "[0.1, 0.02, 0.002]" total_steps: 120000 train_batch_size: 2 eval_batch_size: 4 num_steps_per_eval: 10000 momentum: 0.9 l2_weight_decay: 0.0001 warmup_learning_rate: 0.0001 init_learning_rate: 0.02 data_config{ image_size: "(832, 1344)" augment_input_data: True eval_samples: 500 training_file_pattern: "/workspace/tlt-experiments/data/train*.tfrecord" validation_file_pattern: "/workspace/tlt-experiments/data/val*.tfrecord" val_json_file: "/workspace/tlt-experiments/data/annotations/instances_val2017.json" # dataset specific parameters num_classes: 91 skip_crowd_during_training: True } maskrcnn_config { nlayers: 50 arch: "resnet" freeze_bn: True freeze_blocks: "[0,1]" gt_mask_size: 112 # Region Proposal Network rpn_positive_overlap: 0.7 rpn_negative_overlap: 0.3 rpn_batch_size_per_im: 256 rpn_fg_fraction: 0.5 rpn_min_size: 0. # Proposal layer. batch_size_per_im: 512 fg_fraction: 0.25 fg_thresh: 0.5 bg_thresh_hi: 0.5 bg_thresh_lo: 0. # Faster-RCNN heads. fast_rcnn_mlp_head_dim: 1024 bbox_reg_weights: "(10., 10., 5., 5.)" # Mask-RCNN heads. include_mask: True mrcnn_resolution: 28 # training train_rpn_pre_nms_topn: 2000 train_rpn_post_nms_topn: 1000 train_rpn_nms_threshold: 0.7 # evaluation test_detections_per_image: 100 test_nms: 0.5 test_rpn_pre_nms_topn: 1000 test_rpn_post_nms_topn: 1000 test_rpn_nms_thresh: 0.7 # model architecture min_level: 2 max_level: 6 num_scales: 1 aspect_ratios: "[(1.0, 1.0), (1.4, 0.7), (0.7, 1.4)]" anchor_scale: 8 # localization loss rpn_box_loss_weight: 1.0 fast_rcnn_box_loss_weight: 1.0 mrcnn_weight_loss_mask: 1.0 } +----------------------------+--------------------------------------------------------------------------------+-------------------------------+-------------------------------+ | **Field** | **Description** | **Data Type and Constraints** | **Recommended/Typical Value** | +----------------------------+--------------------------------------------------------------------------------+-------------------------------+-------------------------------+ | seed | The random seed for the experiment. | Unsigned int | 123 | +----------------------------+--------------------------------------------------------------------------------+-------------------------------+-------------------------------+ | warmup_steps | The steps taken for learning rate to ramp up to the init_learning_rate. | Unsigned int | - | +----------------------------+--------------------------------------------------------------------------------+-------------------------------+-------------------------------+ | warmup_learning_rate | The initial learning rate during in the warmup phase. | float | - | +----------------------------+--------------------------------------------------------------------------------+-------------------------------+-------------------------------+ | learning_rate_steps | List of steps, at which the learning rate decays by the factor specified | string | - | | | in learning_rate_decay_levels. | | | +----------------------------+--------------------------------------------------------------------------------+-------------------------------+-------------------------------+ | learning_rate_decay_levels | List of decay factors. The length should match the length of | string | - | | | learning_rate_steps. | | | +----------------------------+--------------------------------------------------------------------------------+-------------------------------+-------------------------------+ | total_steps | Total number of training iterations. | Unsigned int | - | +----------------------------+--------------------------------------------------------------------------------+-------------------------------+-------------------------------+ | train_batch_size | Batch size during training. | Unsigned int | 4 | +----------------------------+--------------------------------------------------------------------------------+-------------------------------+-------------------------------+ | eval_batch_size | Batch size during validation or evaluation. | Unsigned int | 8 | +----------------------------+--------------------------------------------------------------------------------+-------------------------------+-------------------------------+ | num_steps_per_eval | Save a checkpoint and run evaluation every N steps. | Unsigned int | - | +----------------------------+--------------------------------------------------------------------------------+-------------------------------+-------------------------------+ | momentum | Momentum of SGD optimizer. | float | 0.9 | +----------------------------+--------------------------------------------------------------------------------+-------------------------------+-------------------------------+ | l2_weight_decay | L2 weight decay | float | 0.0001 | +----------------------------+--------------------------------------------------------------------------------+-------------------------------+-------------------------------+ | use_amp | Whether to use Automatic Mixed Precision training. | boolean | False | +----------------------------+--------------------------------------------------------------------------------+-------------------------------+-------------------------------+ | checkpoint | Path to a pretrained model. | string | - | +----------------------------+--------------------------------------------------------------------------------+-------------------------------+-------------------------------+ | maskrcnn_config | The architecture of the model. | message | - | +----------------------------+--------------------------------------------------------------------------------+-------------------------------+-------------------------------+ | data_config | Input data configuration. | message | - | +----------------------------+--------------------------------------------------------------------------------+-------------------------------+-------------------------------+ | skip_checkpoint_variables | If specified, the weights of the layers with matching regular expressions will | string | - | | | not be loaded. This is especially helpful for transfer learning. | | | +----------------------------+--------------------------------------------------------------------------------+-------------------------------+-------------------------------+ .. Note:: When using skip_checkpoint_variables, you can first find the model structure in the training log (Part of MaskRCNN+ResNet50 model structure is shown below). If, for example, you want to retrain all prediction heads, you can set skip_checkpoint_variables to “head”. TLT uses Python re library to check whether “head” matches any layer name or re.search($skip_checkpoint_variables, $layer_name). .. code:: [MaskRCNN] INFO : ================ TRAINABLE VARIABLES ================== [MaskRCNN] INFO : [#0001] conv1/kernel:0 => (7, 7, 3, 64) [MaskRCNN] INFO : [#0002] bn_conv1/gamma:0 => (64,) [MaskRCNN] INFO : [#0003] bn_conv1/beta:0 => (64,) [MaskRCNN] INFO : [#0004] block_1a_conv_1/kernel:0 => (1, 1, 64, 64) [MaskRCNN] INFO : [#0005] block_1a_bn_1/gamma:0 => (64,) [MaskRCNN] INFO : [#0006] block_1a_bn_1/beta:0 => (64,) [MaskRCNN] INFO : [#0007] block_1a_conv_2/kernel:0 => (3, 3, 64, 64) [MaskRCNN] INFO : [#0008] block_1a_bn_2/gamma:0 => (64,) [MaskRCNN] INFO : [#0009] block_1a_bn_2/beta:0 => (64,) [MaskRCNN] INFO : [#0010] block_1a_conv_3/kernel:0 => (1, 1, 64, 256) [MaskRCNN] INFO : [#0011] block_1a_bn_3/gamma:0 => (256,) [MaskRCNN] INFO : [#0012] block_1a_bn_3/beta:0 => (256,) [MaskRCNN] INFO : [#0110] block_3d_bn_3/gamma:0 => (1024,) [MaskRCNN] INFO : [#0111] block_3d_bn_3/beta:0 => (1024,) [MaskRCNN] INFO : [#0112] block_3e_conv_1/kernel:0 => (1, 1, 1024, [MaskRCNN] INFO : [#0144] block_4b_bn_1/beta:0 => (512,) … … … … ... [MaskRCNN] INFO : [#0174] fpn/post_hoc_d5/kernel:0 => (3, 3, 256, 256) [MaskRCNN] INFO : [#0175] fpn/post_hoc_d5/bias:0 => (256,) [MaskRCNN] INFO : [#0176] rpn_head/rpn/kernel:0 => (3, 3, 256, 256) [MaskRCNN] INFO : [#0177] rpn_head/rpn/bias:0 => (256,) [MaskRCNN] INFO : [#0178] rpn_head/rpn-class/kernel:0 => (1, 1, 256, 3) [MaskRCNN] INFO : [#0179] rpn_head/rpn-class/bias:0 => (3,) [MaskRCNN] INFO : [#0180] rpn_head/rpn-box/kernel:0 => (1, 1, 256, 12) [MaskRCNN] INFO : [#0181] rpn_head/rpn-box/bias:0 => (12,) [MaskRCNN] INFO : [#0182] box_head/fc6/kernel:0 => (12544, 1024) [MaskRCNN] INFO : [#0183] box_head/fc6/bias:0 => (1024,) [MaskRCNN] INFO : [#0184] box_head/fc7/kernel:0 => (1024, 1024) [MaskRCNN] INFO : [#0185] box_head/fc7/bias:0 => (1024,) [MaskRCNN] INFO : [#0186] box_head/class-predict/kernel:0 => (1024, 91) [MaskRCNN] INFO : [#0187] box_head/class-predict/bias:0 => (91,) [MaskRCNN] INFO : [#0188] box_head/box-predict/kernel:0 => (1024, 364) [MaskRCNN] INFO : [#0189] box_head/box-predict/bias:0 => (364,) [MaskRCNN] INFO : [#0190] mask_head/mask-conv-l0/kernel:0 => (3, 3, 256, 256) [MaskRCNN] INFO : [#0191] mask_head/mask-conv-l0/bias:0 => (256,) [MaskRCNN] INFO : [#0192] mask_head/mask-conv-l1/kernel:0 => (3, 3, 256, 256) [MaskRCNN] INFO : [#0193] mask_head/mask-conv-l1/bias:0 => (256,) [MaskRCNN] INFO : [#0194] mask_head/mask-conv-l2/kernel:0 => (3, 3, 256, 256) [MaskRCNN] INFO : [#0195] mask_head/mask-conv-l2/bias:0 => (256,) [MaskRCNN] INFO : [#0196] mask_head/mask-conv-l3/kernel:0 => (3, 3, 256, 256) [MaskRCNN] INFO : [#0197] mask_head/mask-conv-l3/bias:0 => (256,) [MaskRCNN] INFO : [#0198] mask_head/conv5-mask/kernel:0 => (2, 2, 256, 256) [MaskRCNN] INFO : [#0199] mask_head/conv5-mask/bias:0 => (256,) [MaskRCNN] INFO : [#0200] mask_head/mask_fcn_logits/kernel:0 => (1, 1, 256, 91) [MaskRCNN] INFO : [#0201] mask_head/mask_fcn_logits/bias:0 => (91,) MaskRCNN Config ^^^^^^^^^^^^^^^ The maskrcnn configuration (:code:`maskrcnn_config`) defines the model structure. This model is used for training, evaluation and inference. Detailed description is summarized in the table below. Currently, MaskRCNN only supports ResNet10/18/34/50/101 as its backbone. +---------------------------+---------------------------------------------------------------------+--------------------------------------------------+------------------------------------+ | **Field** | **Description** | **Data Type and Constraints** | **Recommended/Typical Value** | +---------------------------+---------------------------------------------------------------------+--------------------------------------------------+------------------------------------+ | nlayers | Number of layers in ResNet arch | message | 50 | +---------------------------+---------------------------------------------------------------------+--------------------------------------------------+------------------------------------+ | arch | The backbone feature extractor name | string | resnet | +---------------------------+---------------------------------------------------------------------+--------------------------------------------------+------------------------------------+ | freeze_bn | Whether to freeze all BatchNorm layers in the backbone | boolean | False | +---------------------------+---------------------------------------------------------------------+--------------------------------------------------+------------------------------------+ | freeze_blocks | List of conv blocks in the backbone to freeze | string | - | | | | | | | | | ResNet: For the ResNet series, the block IDs | | | | | valid for freezing is any subset of [0, 1, 2, 3] | | | | | (inclusive) | | +---------------------------+---------------------------------------------------------------------+--------------------------------------------------+------------------------------------+ | gt_mask_size | Groundtruth mask size | Unsigned int | 112 | +---------------------------+---------------------------------------------------------------------+--------------------------------------------------+------------------------------------+ | rpn_positive_overlap | Lower bound threshold to assign positive labels for anchors | float | 0.7 | +---------------------------+---------------------------------------------------------------------+--------------------------------------------------+------------------------------------+ | rpn_positive_overlap | Upper bound threshold to assign negative labels for anchors | float | 0.3 | +---------------------------+---------------------------------------------------------------------+--------------------------------------------------+------------------------------------+ | rpn_batch_size_per_im | The number of sampled anchors per image in RPN | Unsigned int | 256 | +---------------------------+---------------------------------------------------------------------+--------------------------------------------------+------------------------------------+ | rpn_fg_fraction | Desired fraction of positive anchors in a batch | Unsigned int | 0.5 | +---------------------------+---------------------------------------------------------------------+--------------------------------------------------+------------------------------------+ | rpn_min_size | Minimum proposal height and width | | 0 | +---------------------------+---------------------------------------------------------------------+--------------------------------------------------+------------------------------------+ | batch_size_per_im | RoI minibatch size per image | Unsigned int | 512 | +---------------------------+---------------------------------------------------------------------+--------------------------------------------------+------------------------------------+ | fg_fraction | The target fraction of RoI minibatch that is labeled as | float | 0.25 | | | foreground | | | +---------------------------+---------------------------------------------------------------------+--------------------------------------------------+------------------------------------+ | fast_rcnn_mlp_head_dim | fast rcnn classification head dimension | Unsigned int | 1024 | +---------------------------+---------------------------------------------------------------------+--------------------------------------------------+------------------------------------+ | bbox_reg_weights | Bounding box regularization weights | string | “(10, 10, 5, 5)” | +---------------------------+---------------------------------------------------------------------+--------------------------------------------------+------------------------------------+ | include_mask | Whether to include mask head | boolean | True | | | | | | | | | | (currently only True is supported) | +---------------------------+---------------------------------------------------------------------+--------------------------------------------------+------------------------------------+ | mrcnn_resolution | Mask head resolution | Unsigned int | 28 | +---------------------------+---------------------------------------------------------------------+--------------------------------------------------+------------------------------------+ | train_rpn_pre_nms_topn | Number of top scoring RPN proposals to keep before applying | Unsigned int | 2000 | | | NMS (per FPN level) | | | +---------------------------+---------------------------------------------------------------------+--------------------------------------------------+------------------------------------+ | train_rpn_post_nms_topn | Number of top scoring RPN proposals to keep after applying NMS | Unsigned int | 1000 | | | (total number produced) | | | +---------------------------+---------------------------------------------------------------------+--------------------------------------------------+------------------------------------+ | train_rpn_nms_threshold | NMS IOU threshold in RPN during training | float | 0.7 | +---------------------------+---------------------------------------------------------------------+--------------------------------------------------+------------------------------------+ | test_detections_per_image | Number of bounding box candidates after NMS | Unsigned int | 100 | +---------------------------+---------------------------------------------------------------------+--------------------------------------------------+------------------------------------+ | test_nms | NMS IOU threshold during test | float | 0.5 | +---------------------------+---------------------------------------------------------------------+--------------------------------------------------+------------------------------------+ | test_rpn_pre_nms_topn | Number of top scoring RPN proposals to keep before applying NMS | Unsigned int | 1000 | | | (per FPN level) during test | | | +---------------------------+---------------------------------------------------------------------+--------------------------------------------------+------------------------------------+ | test_rpn_post_nms_topn | Number of top scoring RPN proposals to keep after applying NMS | Unsigned int | 1000 | | | (total number produced) during test | | | +---------------------------+---------------------------------------------------------------------+--------------------------------------------------+------------------------------------+ | test_rpn_nms_threshold | NMS IOU threshold in RPN during test | float | 0.7 | +---------------------------+---------------------------------------------------------------------+--------------------------------------------------+------------------------------------+ | min_level | Minimum level of the output feature pyramid | Unsigned int | 2 | +---------------------------+---------------------------------------------------------------------+--------------------------------------------------+------------------------------------+ | max_level | Maximum level of the output feature pyramid | Unsigned int | 6 | +---------------------------+---------------------------------------------------------------------+--------------------------------------------------+------------------------------------+ | num_scales | Number of anchor octave scales on each pyramid level (e.g. if it’s | Unsigned int | 1 | | | set to 3, the anchor scales are [2^0, 2^(1/3), 2^(2/3)]) | | | +---------------------------+---------------------------------------------------------------------+--------------------------------------------------+------------------------------------+ | aspect_ratios | List of tuples representing the aspect ratios of anchors on each | string | "[(1.0, 1.0), | | | pyramid level | | (1.4, 0.7), | | | | | (0.7, 1.4)]" | +---------------------------+---------------------------------------------------------------------+--------------------------------------------------+------------------------------------+ | anchor_scale | Scale of base anchor size to the feature pyramid stride | Unsigned int | 8 | +---------------------------+---------------------------------------------------------------------+--------------------------------------------------+------------------------------------+ | rpn_box_loss_weight | Weight for adjusting RPN box loss in the total loss | float | 1.0 | +---------------------------+---------------------------------------------------------------------+--------------------------------------------------+------------------------------------+ | fast_rcnn_box_loss_weight | Weight for adjusting FastRCNN box regression loss in the total loss | float | 1.0 | +---------------------------+---------------------------------------------------------------------+--------------------------------------------------+------------------------------------+ | mrcnn_weight_loss_mask | Weight for adjusting mask loss in the total loss | float | 1.0 | +---------------------------+---------------------------------------------------------------------+--------------------------------------------------+------------------------------------+ .. Note:: The :code:`min_level`, :code:`max_level`, :code:`num_scales`, :code:`aspect_ratios` and :code:`anchor_scale` are used to determine MaskRCNN’s anchor generation. :code:`anchor_scale` is the base anchor’s scale. And :code:`min_level` and :code:`max_level sets` the range of the scales on different feature maps. For example, the actual anchor scale for the feature map at :code:`min_level` will be `anchor_scale * 2^min_level` and the actual anchor scale for the feature map at max_level will be `anchor_scale * 2^max_level`. And it will generate anchors of different :code:`aspect_ratios` based on the actual anchor scale. Data Config ^^^^^^^^^^^ The data configuration (:code:`data_config`) specifies the input data source and format. This is used for training, evaluation and inference. Detailed description is summarized in the table below. +----------------------------+------------------------------------------------------------------+-------------------------------+-------------------------------+ | **Field** | **Description** | **Data Type and Constraints** | **Recommended/Typical Value** | +----------------------------+------------------------------------------------------------------+-------------------------------+-------------------------------+ | image_size | Image dimension as a tuple within quote marks. “(height, width)” | string | “(832, 1344)” | | | indicates the dimension of the resized and padded input | | | +----------------------------+------------------------------------------------------------------+-------------------------------+-------------------------------+ | augment_input_data | Whether to augment data | boolean | True | +----------------------------+------------------------------------------------------------------+-------------------------------+-------------------------------+ | eval_samples | Number of samples for evaluation | Unsigned int | - | +----------------------------+------------------------------------------------------------------+-------------------------------+-------------------------------+ | training_file_pattern | TFRecord path for training | string | - | +----------------------------+------------------------------------------------------------------+-------------------------------+-------------------------------+ | validation_file_pattern | TFRecord path for validation | string | - | +----------------------------+------------------------------------------------------------------+-------------------------------+-------------------------------+ | val_json_file | The annotation file path for validation | string | - | +----------------------------+------------------------------------------------------------------+-------------------------------+-------------------------------+ | num_classes | Number of classes | Unsigned int | - | +----------------------------+------------------------------------------------------------------+-------------------------------+-------------------------------+ | skip_crowd_druing_training | Whether to skip crowd during training | boolean | True | +----------------------------+------------------------------------------------------------------+-------------------------------+-------------------------------+