EfficientDet ============ .. _efficientdet: With EfficientDet, the following tasks are supported: * train * evaluate * prune * inference * export These tasks may be invoked from the TAO Toolkit Launcher by following the below convention from command line: .. code:: tao efficientdet Where :code:`args_per_subtask` are the command line arguments required for a given subtask. Each of these sub-tasks are explained in detail below. Data Input for EfficientDet ------------------------------- EfficientDet expects directories of images for training or validation and annotation files in COCO format. See the :ref:`Data Annotation Format ` page for more information about the data format for EfficientDet. The naming convention for train/val split can be different because the path of each set is individually specified in the data preparation script in the IPython notebook example. Image data and the corresponding annotation file is then converted to TFRecords for training. Creating a Configuration File ----------------------------- .. _specification_file_efficientdet: Below is a sample for the EfficientDet spec file. It has 5 major components: :code:`model_config`, :code:`training_config`, :code:`eval_config`, :code:`augmentation_config` and :code:`dataset_config`. The format of the spec file is a protobuf text (prototxt) message, and each of its fields can be either a basic data type or a nested message. The top level structure of the spec file is summarized in the table below: .. code:: training_config { train_batch_size: 16 iterations_per_loop: 10 checkpoint_period: 10 num_examples_per_epoch: 14700 num_epochs: 300 model_name: 'efficientdet-d0' profile_skip_steps: 100 tf_random_seed: 42 lr_warmup_epoch: 5 lr_warmup_init: 0.00005 learning_rate: 0.1 amp: True moving_average_decay: 0.9999 l2_weight_decay: 0.00004 l1_weight_decay: 0.0 checkpoint: "/path/to/your/pretrained_model" # pruned_model_path: "/path/to/your/pruned/model" } dataset_config { num_classes: 91 image_size: "512,512" training_file_pattern: "/path/to/coco/train-*" validation_file_pattern: "/path/to/coco/val-*" validation_json_file: "/path/to/coco/annotations/instances_val2017.json" } eval_config { eval_batch_size: 16 eval_epoch_cycle: 10 eval_after_training: True eval_samples: 5000 min_score_thresh: 0.4 max_detections_per_image: 100 } model_config { model_name: 'efficientdet-d0' min_level: 3 max_level: 7 num_scales: 3 } augmentation_config { rand_hflip: True random_crop_min_scale: 0.1 random_crop_min_scale: 2.0 } Training Config ^^^^^^^^^^^^^^^ .. _training_config_efficientdet: The training configuration(:code:`training_config`) defines the parameters needed for training, evaluation, and inference. Details are summarized in the table below. +---------------------+-------------------------------------------------------------------------------------------------------+-------------------------------+--------------------------------------------------------------------------------------+ | **Field** | **Description** | **Data Type and Constraints** | **Recommended/Typical Value** | +---------------------+-------------------------------------------------------------------------------------------------------+-------------------------------+--------------------------------------------------------------------------------------+ | train_batch_size | The batch size for each GPU, so the effective batch size is batch_size_per_gpu * num_gpus. | Unsigned int, positive | 16 | +---------------------+-------------------------------------------------------------------------------------------------------+-------------------------------+--------------------------------------------------------------------------------------+ | num_epochs | The number of epochs to train the network | Unsigned int, positive | 300 | +---------------------+-------------------------------------------------------------------------------------------------------+-------------------------------+--------------------------------------------------------------------------------------+ | num_examples_per | Total number of images in the training set divided by the number of GPUs | Unsigned int, positive | -- | | _epoch | | | | +---------------------+-------------------------------------------------------------------------------------------------------+-------------------------------+--------------------------------------------------------------------------------------+ | checkpoint | The path to the pretrained model, if any | String | -- | +---------------------+-------------------------------------------------------------------------------------------------------+-------------------------------+--------------------------------------------------------------------------------------+ | pruned_model_path | The path to a TAO pruned model for re-training, if any | String | -- | +---------------------+-------------------------------------------------------------------------------------------------------+-------------------------------+--------------------------------------------------------------------------------------+ | checkpoint_period | The number of training epochs that should run per model checkpoint/validation | Unsigned int, positive | 10 | +---------------------+-------------------------------------------------------------------------------------------------------+-------------------------------+--------------------------------------------------------------------------------------+ | amp | Whether to use mixed precision training | Boolean | -- | +---------------------+-------------------------------------------------------------------------------------------------------+-------------------------------+--------------------------------------------------------------------------------------+ | moving_average_decay| Moving average decay | Float | 0.9999 | +---------------------+-------------------------------------------------------------------------------------------------------+-------------------------------+--------------------------------------------------------------------------------------+ | l2_weight_decay | L2 weight decay | Float | -- | +---------------------+-------------------------------------------------------------------------------------------------------+-------------------------------+--------------------------------------------------------------------------------------+ | l1_weight_decay | L1 weight decay | Float | -- | +---------------------+-------------------------------------------------------------------------------------------------------+-------------------------------+--------------------------------------------------------------------------------------+ | lr_warmup_epoch | The number of warmup epochs in the learning rate schedule | Unsigned int, positive | -- | +---------------------+-------------------------------------------------------------------------------------------------------+-------------------------------+--------------------------------------------------------------------------------------+ | lr_warmup_init | The initial learning rate in the warmup period | Float | -- | +---------------------+-------------------------------------------------------------------------------------------------------+-------------------------------+--------------------------------------------------------------------------------------+ | learning_rate | The maximum learning rate | Float | -- | +---------------------+-------------------------------------------------------------------------------------------------------+-------------------------------+--------------------------------------------------------------------------------------+ | tf_random_seed | The random seed | Unsigned int, positive | 42 | +---------------------+-------------------------------------------------------------------------------------------------------+-------------------------------+--------------------------------------------------------------------------------------+ | clip_gradients_norm | Clip gradients by the norm value | Float | 5 | +---------------------+-------------------------------------------------------------------------------------------------------+-------------------------------+--------------------------------------------------------------------------------------+ | skip_checkpoint | If specified, the weights of the layers with matching regular expressions will | string | "-predict*" | | _variables | not be loaded. This is especially helpful for transfer learning. | | | +---------------------+-------------------------------------------------------------------------------------------------------+-------------------------------+--------------------------------------------------------------------------------------+ Evaluation Config ^^^^^^^^^^^^^^^^^ The evaluation configuration (:code:`eval_config`) defines the parameters needed for the evaluation either during training or standalone. Details are summarized in the table below. +-----------------------------------+-------------------------------------------------------------------------------------------+----------------------------------+-------------------------------+ | **Field** | **Description** | **Data Type and Constraints** | **Recommended/Typical Value** | +-----------------------------------+-------------------------------------------------------------------------------------------+----------------------------------+-------------------------------+ | eval_epoch_cycle | The number of training epochs that should run per validation | Unsigned int, positive | 10 | +-----------------------------------+-------------------------------------------------------------------------------------------+----------------------------------+-------------------------------+ | max_detections_per_image | The maximum number of detections to visualize | Unsigned int, positive | 100 | +-----------------------------------+-------------------------------------------------------------------------------------------+----------------------------------+-------------------------------+ | min_score_thresh | The lowest IoU of the predicted box and ground truth box that can be considered a match | Float | 0.5 | +-----------------------------------+-------------------------------------------------------------------------------------------+----------------------------------+-------------------------------+ | eval_batch_size | The batch size for each GPU, so the effective batch size is batch_size_per_gpu * num_gpus | Unsigned int, positive | 16 | +-----------------------------------+-------------------------------------------------------------------------------------------+----------------------------------+-------------------------------+ | eval_samples | The number of samples for evaluation | Unsigned int | -- | +-----------------------------------+-------------------------------------------------------------------------------------------+----------------------------------+-------------------------------+ Dataset Config ^^^^^^^^^^^^^^ The data configuration (:code:`data_config`) specifies the input data source and format. This is used for training, evaluation, and inference. A detailed description is summarized in the table below. +----------------------------+----------------------------------------------------------------------+-------------------------------+-------------------------------+ | **Field** | **Description** | **Data Type and Constraints** | **Recommended/Typical Value** | +----------------------------+----------------------------------------------------------------------+-------------------------------+-------------------------------+ | image_size | The image dimension as a tuple within quote marks. “(height, | String | “(512, 512)” | | | width)” indicates the dimension of the resized and padded input. | | | +----------------------------+----------------------------------------------------------------------+-------------------------------+-------------------------------+ | training_file_pattern | The TFRecord path for training | String | -- | +----------------------------+----------------------------------------------------------------------+-------------------------------+-------------------------------+ | validation_file_pattern | The TFRecord path for validation | String | -- | +----------------------------+----------------------------------------------------------------------+-------------------------------+-------------------------------+ | val_json_file | The annotation file path for validation | String | -- | +----------------------------+----------------------------------------------------------------------+-------------------------------+-------------------------------+ | num_classes | The number of classes. If there are N categories in | Unsigned int | -- | | | the annotation, num_classes should be N+1 (background class) | | | +----------------------------+----------------------------------------------------------------------+-------------------------------+-------------------------------+ | max_instances_per_image | The maximum number of object instances to parse (default: 100) | Unsigned int | 100 | +----------------------------+----------------------------------------------------------------------+-------------------------------+-------------------------------+ | skip_crowd_during_training | Specifies whether to skip crowd during training | Boolean | True | +----------------------------+----------------------------------------------------------------------+-------------------------------+-------------------------------+ .. Note: When training with multiple GPUs, the number of TFRecord shards specified by the file pattern should be greater or equal to the number of GPUs. Model Config ^^^^^^^^^^^^ The model configuration (:code:`model_config`) specifies the model structure. A detailed description is summarized in the table below. +----------------------------+----------------------------------------------------------------------+-------------------------------+-------------------------------+ | **Field** | **Description** | **Data Type and Constraints** | **Recommended/Typical Value** | +----------------------------+----------------------------------------------------------------------+-------------------------------+-------------------------------+ | model_name | The TFRecord path for training | string | -- | +----------------------------+----------------------------------------------------------------------+-------------------------------+-------------------------------+ | min_level | The minimum level of the output feature pyramid | Unsigned int | 3 | +----------------------------+----------------------------------------------------------------------+-------------------------------+-------------------------------+ | max_level | The maximum level of the output feature pyramid | Unsigned int | 7 | +----------------------------+----------------------------------------------------------------------+-------------------------------+-------------------------------+ | num_scales | The number of anchor octave scales on each pyramid level (e.g. if | Unsigned int | 3 | | | set to 3, the anchor scales are [2^0, 2^(1/3), 2^(2/3)]) | | | +----------------------------+----------------------------------------------------------------------+-------------------------------+-------------------------------+ | max_instances_per_image | The maximum number of object instances to parse (default: 100) | Unsigned int | 100 | +----------------------------+----------------------------------------------------------------------+-------------------------------+-------------------------------+ | aspect_ratios | A list of tuples representing the aspect ratios of anchors on each | string | "[(1.0, 1.0), | | | pyramid level | | (1.4, 0.7), | | | | | (0.7, 1.4)]" | +----------------------------+----------------------------------------------------------------------+-------------------------------+-------------------------------+ | anchor_scale | Scale of the base-anchor size to the feature-pyramid stride | Unsigned int | 4 | +----------------------------+----------------------------------------------------------------------+-------------------------------+-------------------------------+ Augmentation Config ^^^^^^^^^^^^^^^^^^^ The :code:`augmentation_config` parameter defines image augmentation after preprocessing. +------------------------------+----------------------------------------------------------------------+-------------------------------+-------------------------------+ | **Field** | **Description** | **Data Type and Constraints** | **Recommended/Typical Value** | +------------------------------+----------------------------------------------------------------------+-------------------------------+-------------------------------+ | rand_hflip | Whether to perform random horizontal flip | Boolean | -- | +------------------------------+----------------------------------------------------------------------+-------------------------------+-------------------------------+ | random_crop_min_scale | The minimum scale of RandomCrop augmentation. Default: 0.1 | Float | -- | +------------------------------+----------------------------------------------------------------------+-------------------------------+-------------------------------+ | random_crop_max_scale | The maximum scale of RandomCrop augmentation. Default: 2.0 | Float | -- | +------------------------------+----------------------------------------------------------------------+-------------------------------+-------------------------------+ Training the Model ------------------ .. _training_the_model_efficientdet: Train the EfficientDet model using this command: .. code:: tao efficientdet train [-h] -e -d -k [--gpus ] [--gpu_index ] [--log_file ] Required Arguments ^^^^^^^^^^^^^^^^^^ * :code:`-d, --model_dir`: The path to the folder where the experiment output is written * :code:`-k, --key`: The encryption key to decrypt the model. * :code:`-e, --experiment_spec_file`: The experiment specification file to set up the evaluation experiment. This should be the same as the training specification file. Optional Arguments ^^^^^^^^^^^^^^^^^^ * :code:`--gpus`: The number of GPUs to be used for training in a multi-GPU scenario. The default value is 1. * :code:`--gpu_index`: The indices of the GPUs to use for training. This argument can be used when the machine has multiple GPUs installed. * :code:`--log_file`: The path to the log file. The default value is :code:`stdout`. * :code:`-h, --help`: Show this help message and exit. Input Requirement ^^^^^^^^^^^^^^^^^ * **Input size**: C * W * H (where C = 1 or 3, W >= 128, H >= 128; W, H are multiples of 32) * **Image format**: JPG * **Label format**: COCO detection Sample Usage ^^^^^^^^^^^^ Here's an example of the :code:`train` command: .. code:: tao efficientdet train --gpus 2 -e /path/to/spec.txt -d /path/to/result -k $KEY Evaluating the Model -------------------- To run evaluation with an EfficientDet model, use this command: .. code:: tao efficientdet evaluate [-h] -e -m -k [--gpu_index ] [--log_file ] Required Arguments ^^^^^^^^^^^^^^^^^^ * :code:`-e, --experiment_spec_file`: The experiment spec file to set up the evaluation experiment. This should be the same as the training specification file. * :code:`-m, --model_path`: The path to the model file to use for evaluation (only the TAO model is supported) * :code:`-k, --key`: The key to load the TAO model Optional Arguments ^^^^^^^^^^^^^^^^^^ * :code:`--gpu_index`: The index of the GPU to use for evaluation. This argument can be used when the machine has multiple GPUs installed. Note that evaluation can only run on a single GPU. * :code:`--log_file`: The path to the log file. The default value is :code:`stdout`. * :code:`-h, --help`: Show this help message and exit. Sample Usage ^^^^^^^^^^^^ Here's an example of using the :code:`evaluate` command: .. code:: tao efficientdet evaluate -e /path/to/spec.txt -m /path/to/model.tlt -k $KEY Running Inference with an EfficientDet Model -------------------------------------------- The inference tool for EfficientDet models can be used to visualize bboxes and generate frame-by- frame KITTI format labels on a directory of images. .. code:: tao efficientdet inference [-h] -i -o -e -m -k [-l ] [--gpu_index ] [--log_file ] Required Arguments ^^^^^^^^^^^^^^^^^^ * :code:`-m, --model_path`: The path to the pretrained model (supports both the TAO model and TensorRT engine) * :code:`-i, --in_image_path`: The directory of input images for inference * :code:`-o, --out_image_path`: The directory path to output annotated images * :code:`-k, --key`: The key to load a TAO model (it's not required if a TensorRT engine is used) * :code:`-e, --experiment_spec_file`: The path to an experiment spec file for training Optional Arguments ^^^^^^^^^^^^^^^^^^ * :code:`-l, --out_label_path`: The directory to output KITTI labels * :code:`--label_map`: The path to a text file of training labels * :code:`--gpu_index`: The index of the GPU to run inference on. This argument can be used when the machine has multiple GPUs installed. Note that inference can only run on a single GPU. * :code:`--log_file`: The path to the log file. The default value is :code:`stdout`. * :code:`-h, --help`: Show this help message and exit Sample Usage ^^^^^^^^^^^^ Here's an example of using the :code:`inference` command: .. code:: tao efficientdet inference -e /path/to/spec.txt -m /path/to/model.tlt -k $KEY -o /path/to/output_dir -i /path/to/input_dir Pruning the Model ----------------- .. _pruning_the_model_efficientdet: Pruning removes parameters from the model to reduce the model size without compromising the integrity of the model itself using the :code:`tao efficientdet prune` command. The :code:`tao efficientdet prune` command includes these parameters: .. code:: tao efficientdet prune [-h] -m -o -k [-n ] [-eq ] [-pg ] [-pth ] [-nf ] [-el [] [--gpu_index ] [--log_file ] Required Arguments ^^^^^^^^^^^^^^^^^^ * :code:`-m, --model`: The path to a pretrained EfficientDet model. * :code:`-o, --output_dir`: The path to output checkpoints. * :code:`-k, --key`: The key to load a :code`.tlt` model. Optional Arguments ^^^^^^^^^^^^^^^^^^ * :code:`-n, –normalizer`: Specify ``max`` to normalize by dividing each norm by the maximum norm within a layer; specify ``L2`` to normalize by dividing by the L2 norm of the vector comprising all kernel norms. The default value is ``max``. * :code:`-eq, --equalization_criterion`: The criteria to equalize the stats of inputs to an element-wise op layer or depth-wise convolutional layer. This parameter is useful for resnets and mobilenets. Options are :code:`arithmetic_mean`,:code:`geometric_mean`, :code:`union`, and :code:`intersection`. The default option is :code:`union`. * :code:`-pg, -pruning_granularity`: The number of filters to remove at a time. The default value is 8. * :code:`-pth`: The threshold to compare the normalized norm against. The default value is 0.1. .. Note:: NVIDIA recommends changing the threshold to keep the number of parameters in the model to within 10-20% of the original unpruned model. * :code:`-nf, --min_num_filters`: The minimum number of filters to keep per layer. The default value is 16. * :code:`-el, --excluded_layers`: A list of excluded_layers (e.g. "-i item1 item2". The default value is :code:`[]`. * :code:`--gpu_index`: The index of the GPU to run pruning on. This argument can be used when the machine has multiple GPUs installed. Note that inference can only run on a single GPU. * :code:`--log_file`: Path to the log file. The default value is :code:`stdout`. * :code:`-h, --help`: Show this help message and exit. After pruning, the model needs to be retrained. See :ref:`Re-training the Pruned Model ` for more details. .. Note:: Due to the complexity of larger EfficientDet models, the pruning process will take significantly longer to finish. For example, pruning the EfficientDet-D5 model may take at least 25 minutes on a V100 server. Using the Prune Command ^^^^^^^^^^^^^^^^^^^^^^^ Here's an example of using the :code:`tao efficientdet prune` command: .. code:: tao efficientdet prune -m /path/to/model.step-0.tlt \ -o /path/to/pruned_model/ \ -eq union \ -pth 0.7 -k $KEY Re-training the Pruned Model ---------------------------- .. _re-training_the_pruned_model_efficientdet: Once the model has been pruned, there might be a slight decrease in accuracy because some previously useful weights may have been removed. To regain the accuracy, we recommend that you retrain this pruned model over the same dataset. To do this, use the :code:`tao efficientdet train` command as documented in :ref:`Training the model `, with an updated spec file that points to the newly pruned model as the pretrained model file. We recommend turning off the regularizer or reducing the weight decay in the :code:`training_config` for EfficientDet to recover the accuracy when retraining a pruned model. To do this, set the regularizer type to :code:`NO_REG` as mentioned in the :ref:`Training config ` section. All the other parameters may be retained in the spec file from the previous training. Exporting the Model ------------------- .. _exporting_the_model_efficientdet: Exporting the model decouples the training process from deployment and allows conversion to TensorRT engines outside the TAO environment. TensorRT engines are specific to each hardware configuration and should be generated for each unique inference environment. The exported model may be used universally across training and deployment hardware. The exported model format is referred to as :code:`.etlt`. The :code:`.etlt` model format is also an encrypted model format, and it uses the same key as the :code:`.tlt` model that it is exported from. This key is required when deploying this model. INT8 Mode Overview ^^^^^^^^^^^^^^^^^^ TensorRT engines can be generated in INT8 mode to improve performance, but require a calibration cache at engine creation-time. The calibration cache is generated using a calibration tensor file if :code:`tao efficientdet export` is run with the :code:`--data_type` flag set to :code:`int8`. Pre-generating the calibration information and caching it removes the need for calibrating the model on the inference machine. Moving the calibration cache is usually much more convenient than moving the calibration Tensorfile since it is a much smaller file and can be moved with the exported model. Using the calibration cache also speeds up engine creation, as building the cache can take several minutes to generate depending on the size of the Tensorfile and the model itself. The export tool can generate an INT8 calibration cache by ingesting training data using either of these options: * **Option 1**: Using the training data loader to load the training images for INT8 calibration. This option is now the recommended approach to support multiple image directories by leveraging the training dataset loader. This also ensures two important aspects of the data during calibration: * Data pre-processing in the INT8 calibration step is the same as in the training process. * The data batches are sampled randomly across the entire training dataset, thereby improving the accuracy of the INT8 model. * **Option 2**: Pointing the tool to a directory of images that you want to use to calibrate the model. For this option, make sure to create a sub-sampled directory of random images that best represent your training dataset. FP16/FP32 Model ^^^^^^^^^^^^^^^ The :code:`calibration.bin` is only required if you need to run inference at INT8 precision. For FP16/FP32-based inference, the export step is much simpler: It merely requires you to convert the :code:`.tlt` model from the training/retraining step to :code:`.etlt`. Exporting the EfficientDet Model ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Here's an example of the command line arguments of the :code:`tao efficientdet export` command: .. code:: tao efficientdet export [-h] -m -e -k [-o ] [--cal_data_file ] [--cal_image_dir ] [--data_type ] [--batches ] [--max_batch_size ] [--max_workspace_size ] [--engine_file ] [--gpu_index ] [--log_file ] [--verbose] Required Arguments ****************** * :code:`-m, --model_path`: The path to the :code:`.tlt` model file to be exported * :code:`-k, --key`: The key used to save the :code:`.tlt` model file * :code:`-e, --experiment_spec`: The path to the spec file * :code:`-o, --output_path`: The path to save the exported model to Optional Arguments ****************** * :code:`--data_type`: The desired engine data type, which generates a calibration cache if in INT8 mode. The options are :code:`fp32`, :code:`fp16`, and :code:`int8`. The default value is :code:`fp32`. If using INT8, the following INT8 arguments are required. * :code:`--gpu_index`: The index of (discrete) GPUs used for exporting the model. You can specify the index of the GPU to run export if the machine has multiple GPUs installed. Note that export can only run on a single GPU. * :code:`--log_file`: The path to the log file. The default value is :code:`stdout`. * :code:`-h, --help`: Show this help message and exits. INT8 Export Mode Required Arguments ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * :code:`--cal_image_dir`: The directory of images to use for calibration * :code:`--cal_cache_file`: The path where the calibration cache file should be saved INT8 Export Optional Arguments ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * :code:`--batches`: The number of batches to use for calibration and inference testing. The default value is 10. * :code:`--batch_size`: The batch size to use for calibration. The default value is 16. * :code:`--max_batch_size`: The maximum batch size of the TensorRT engine. The default value is 1. * :code:`--max_workspace_size`: The maximum workspace size of TensorRT engine (in Gb). The default value is 2. * :code:`--engine_file`: The path to the serialized TensorRT engine file. Note that this file is hardware specific and cannot be generalized across GPUs. The engine file is useful for quickly testing your model accuracy using TensorRT on the host. As the TensorRT engine file is hardware specific, you cannot use this engine file for deployment unless the deployment GPU is identical to the training GPU. .. Note:: Due to the complexity of EfficientDet models, the export process with TensorRT engine serialization will take some time to finish. For example, it may take several minutes on a V100 and more than a hour on a Xavier. Sample usage ^^^^^^^^^^^^ Here's a sample command to export an EfficientDet model in INT8 mode. .. code:: tao efficientdet export -m /path/to/model.step-0.tlt \ -o /path/to/export/model.step-0.etlt \ -e /ws/spec.txt \ -k $KEY \ --cal_image_dir /ws/data/ \ --data_type int8 \ --batch_size 1 \ --batches 10 \ --cal_cache_file /path/to/export/cal.bin \ --cal_data_file /path/to/export/cal.tensorfile Deploying to DeepStream ----------------------- .. _deploying_to_deepstream_efficientdet: .. include:: ../excerpts/deploying_to_deepstream.rst .. _here: https://docs.nvidia.com/metropolis/deepstream/dev-guide/index.html TensorRT Open Source Software (OSS) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TensorRT OSS build is required for EfficientDet models. This is required because several TensorRT plugins that are required by these models are only available in TensorRT open source repo and not in the general TensorRT release. Specifically, for EfficientDet, we need the :code:`batchTilePlugin` and :code:`NMSPlugin`. If the deployment platform is x86 with NVIDIA GPU, follow instructions for x86; if your deployment is on NVIDIA Jetson platform, follow instructions for Jetson. TensorRT OSS on x86 ******************* .. include:: ../excerpts/tensorrt_oss_on_x86.rst TensorRT OSS on Jetson (ARM64) ****************************** .. include:: ../excerpts/tensorrt_oss_on_jetson_arm64.rst Generating an Engine Using tao-converter ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. _generating_an_engine_using_tao-converter_efficientdet: .. include:: ../excerpts/generating_an_engine_using_tao-converter.rst Instructions for x86 ******************** .. include:: ../excerpts/instructions_for_x86_with_OSS.rst Instructions for Jetson *********************** .. include:: ../excerpts/instructions_for_jetson_with_OSS.rst Using the tao-converter *********************** .. code:: tao-converter [-h] -k -d -o [-c ] [-e ] [-b ] [-m ] [-t ] [-w ] [-i ] [-p ] [-s] [-u ] input_file Required Arguments ~~~~~~~~~~~~~~~~~~ * :code:`input_file`:The path to the :code:`.etlt` model exported using :code:`tao efficientdet export` * :code:`-k`: The key used to encode the :code:`.tlt` model when training * :code:`-d`: A comma-separated list of input dimensions that should match the dimensions used for :code:`tao efficientdet export` * :code:`-o`: A comma-separated list of output blob names that should match the output configuration used for :code:`tao efficientdet export`. For EfficientDet, set this argument to :code:`NMS`. Optional Arguments ~~~~~~~~~~~~~~~~~~ * :code:`-e`: The path to save the engine to. The default path is :code:`./saved.engine`. * :code:`-t`: The desired engine data type, which generates calibration cache if in INT8 mode. The default value is :code:`fp32`. The options are :code:`fp32`, :code:`fp16`, and :code:`int8`. * :code:`-w`: The maximum workspace size for the TensorRT engine. The default value is :code:`1073741824(1<<30)`. * :code:`-i`: The input dimension ordering; all other TAO commands use NCHW. The options are :code:`nchw`, :code:`nhwc`, :code:`nc`. For EfficientDet, you can omit this argument since the default value is :code:`nchw`. * :code:`-p`: Optimization profiles for :code:`.etlt` models with dynamic shape, consisting of a comma-separated list of optimization profile shapes in the format :code:`,,,`, where each shape has the format: :code:`xxx`. This argument can be specified multiple times if there are multiple input tensors for the model. This is only useful for new models introduced since version 3.0. This parameter is not required for models that were already in version 2.0. * :code:`-s`: TensorRT strict-type constraints. A Boolean to apply TensorRT strict type constraints when building the TensorRT engine. * :code:`-u`: Specifies the DLA core index to use when building the TensorRT engine on Jetson devices INT8 Mode Arguments ~~~~~~~~~~~~~~~~~~~ * :code:`-c`: The path to the calibration cache file, which is only used in INT8 mode. The default value is :code:`./cal.bin`. * :code:`-b`: Batch size used during the export step for INT8 calibration cache generation. The default value is :code:`8`. * :code:`-m`: The maximum batch size for the TensorRT engine. The default value is :code:`16`. If you encounter out-of-memory issues, decrease the batch size accordingly. .. Note:: Due to the complexity of EfficientDet models, the conversion process will take some time to finish. For example, it may take several minutes on a V100 and more than a hour on a Xavier. Sample Output Log ~~~~~~~~~~~~~~~~~ Here is a sample command for exporting an EfficientDet model. .. code:: tao converter -k $KEY \ -c /export/model.step-0.cal \ -p Input,1x512x512x3,8x512x512x3,16x512x512x3 \ -e /export/trt.int8.engine \ -t int8 \ -b 8 \ /export/model.step-0.etlt Integrating the model to DeepStream ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ There are two options to integrate models from TAO with DeepStream: * **Option 1**: Integrate the model (.etlt) with the encrypted key directly in the DeepStream app. The model file is generated by :code:`tao efficientdet export`. * **Option 2**: Generate a device-specific optimized TensorRT engine using tao-converter. The TensorRT engine file can also be ingested by DeepStream. For EfficientDet, we will need to build TensorRT Open source plugins and custom bounding box parser. The instructions are provided in the TensorRT OSS section above, and the required code can be found in this `GitHub repo`_. .. _GitHub repo: https://github.com/NVIDIA-AI-IOT/deepstream_tlt_apps To integrate the models with DeepStream, you need the following: * The DeepStream SDK (`download page`_). The installation instructions for DeepStream are provided in the `DeepStream Development Guide`_. * An exported :code:`.etlt` model file and optional calibration cache for INT8 precision. * `TensorRT 8+ OSS Plugins`_ . * A :code:`labels.txt` file containing the labels for classes in the order in which the networks produces outputs. * A sample :code:`config_infer_*.txt` file to configure the nvinfer element in DeepStream. The nvinfer element handles everything related to TensorRT optimization and engine creation in DeepStream. .. _download page: https://developer.nvidia.com/deepstream-download .. _DeepStream Development Guide: https://docs.nvidia.com/metropolis/deepstream/dev-guide/index.html .. _TensorRT 8+ OSS Plugins : https://github.com/NVIDIA/TensorRT/tree/release/8.0 The DeepStream SDK ships with an end-to-end reference application that is fully configurable. You can configure input sources, inference model, and output sinks. The app requires a primary object detection model, followed by an optional secondary classification model. The reference application is installed as :code:`deepstream-app`. The graphic below shows the architecture of the reference application. .. figure:: ../../content/arch_ref_appl.png :class: with-shadow :align: center :width: 80 % There are typically two or more configuration files that are used with this app. In the install directory, the config files are located in :code:`samples/configs/deepstream-app` or :code:`sample/configs/tlt_pretrained_models`. The main config file configures all the high level parameters in the pipeline above. This would set the input source and resolution, number of inferences, tracker, and output sinks. The other supporting config files are for each individual inference engine. The inference specific config files are used to specify models, inference resolution, batch size, number of classes, and other customization. The main config file will call all the supporting config files. Here are some config files in :code:`samples/configs/deepstream-app` for reference: * :code:`source4_1080p_dec_infer-resnet_tracker_sgie_tiled_display_int8.txt`: The main config file * :code:`config_infer_primary.txt`: The supporting config file for primary detector in the pipeline above * :code:`config_infer_secondary_*.txt`: The supporting config file for secondary classifier in the pipeline above The :code:`deepstream-app` will only work with the main config file. This file will most likely remain the same for all models and can be used directly from the DeepStream SDK with little to no change. You will only need to modify or create :code:`config_infer_primary.txt` and :code:`config_infer_secondary_*.txt`. Integrating an EfficientDet Model ********************************* To run an EfficientDet model in DeepStream, you need a label file and a DeepStream configuration file. In addition, you need to compile the TensorRT 8+ OSS and EfficientDet bounding box parser for DeepStream. A DeepStream sample with documentation on how to run inference using the trained EfficientDet models from TAO Toolkit is provided on GitHub here_. Prerequisite for EfficientDet Model ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1. EfficientDet requires ResizeNearest_TRT and EfficientNMS_TRT. These plugins are available in the TensorRT open source repo. Detailed instructions to build TensorRT OSS can be found in `TensorRT Open Source Software (OSS)`. 2. EfficientDet requires custom bounding-box parsers that are not built-in inside the DeepStream SDK. The source code to build custom bounding-box parsers for EfficientDet is available here_. The following instructions can be used to build the bounding-box parser: a. Install git-lfs_ (git >= 1.8.2) .. _git-lfs: https://github.com/git-lfs/git-lfs/wiki/Installation .. code:: curl -s https://packagecloud.io/install/repositories/github/git-lfs/ script.deb.sh | sudo bash sudo apt-get install git-lfs git lfs install b. Download the source code with SSH or HTTPS: .. code:: git clone -b release/tlt3.0 https://github.com/NVIDIA-AI-IOT/deepstream_tlt_apps c. Build the custom bounding-box parser: .. code:: // or Path for DS installation export CUDA_VER=10.2 // CUDA version, e.g. 10.2 make This generates :code:`libnvds_infercustomparser_tlt.so` in the directory :code:`post_processor`. Label File ^^^^^^^^^^ If the COCO annotation file has the following in :code:`categories`: .. code:: [{'supercategory': 'person', 'id': 1, 'name': 'person'}, {'supercategory': 'car', 'id': 2, 'name': 'car'}] Then the corresponding :code:`maskrcnn_labels.txt` file will be as follows: .. code:: BG person car DeepStream Configuration File ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The detection model is typically used as a primary inference engine. It can also be used as a secondary inference engine. To run this model in the sample :code:`deepstream-app`, you must modify the existing :code:`config_infer_primary.txt` file to point to this model. .. figure:: ../../content/dstream_deploy_options2.png :class: with-shadow :align: center :width: 62 % **Option 1**: Integrate the model (:code:`.etlt`) directly in the DeepStream app. For this option, users will need to add the following parameters in the configuration file. The :code:`int8-calib-file` is only required for INT8 precision. .. code:: tlt-encoded-model= tlt-model-key= int8-calib-file= The :code:`tlt-encoded-model` parameter points to the exported model (:code:`.etlt`) from TLT. The :code:`tlt-model-key` is the encryption key used during model export. **Option 2**: Integrate the TensorRT engine file with DeepStream app. 1. Generate the TensorRT engine using tao-converter. Detailed instructions are provided in the :ref:`Generating an engine using tao-converter ` section above. 2. Once the engine file is generated successfully, modify the following parameters to use this engine with DeepStream. .. code:: model-engine-file= All other parameters are common between the two approaches. To use the custom bounding-box parser instead of the default parsers in DeepStream, modify the following parameters in the :code:`[property]` section of the primary infer configuration file: .. code:: parse-bbox-func-name=NvDsInferParseCustomEfficientDetTAO custom-lib-path= Add the label file generated above using the following: .. code:: labelfile-path= For all the options, see the sample configuration file below. To learn about what all the parameters are used for, refer to the `DeepStream Development Guide`_. .. _DeepStream Development Guide: https://docs.nvidia.com/metropolis/deepstream/dev-guide/index.html .. code:: [property] gpu-id=0 net-scale-factor=1.0 offsets=0;0;0 model-color-format=0 network-input-order=1 labelfile-path=efficientdet_d0_labels.txt model-engine-file=./d0_avlp_bs1_int8.engine int8-calib-file=d0.cal tlt-encoded-model=d0_avlp.etlt tlt-model-key=nvidia_tlt infer-dims=3;512;512 maintain-aspect-ratio=1 uff-input-blob-name=image_arrays:0 batch-size=1 ## 0=FP32, 1=INT8, 2=FP16 mode network-mode=2 num-detected-classes=1 interval=0 gie-unique-id=1 is-classifier=0 #network-type=0 cluster-mode=4 output-blob-names=num_detections;detection_boxes;detection_scores;detection_classes parse-bbox-func-name=NvDsInferParseCustomEfficientDetTAO custom-lib-path=nvdsinfer_custombboxparser_efficientdet_tao.so [class-attrs-all] pre-cluster-threshold=0.3 roi-top-offset=0 roi-bottom-offset=0 detected-min-w=0 detected-min-h=0 detected-max-w=0 detected-max-h=0