NVIDIA TAO Toolkit v5.2.0
TAO Toolkit v5.2.0

MLRecogNet with TAO Deploy

To generate an optimized TensorRT engine, the MLRecogNet .onnx file, which is generated using tao export, is taken as an input to tao-deploy. Currently, MLRecogNet supports FP32, FP16 and INT8 data types.

For more information about training an MLRecogNet model, refer to the MLRecogNet training documentation.

Here is an example spec $TRT_GEN_SPEC for generating TensorRT engine from the exported MLRecogNet onnx model.

trt_config

The trt_config parameter provides options related to TensorRT generation.

Copy
Copied!
            

results_dir: /path/to/results/dir dataset: val_dataset: reference: /path/to/reference/set query: /path/to/query/set pixel_mean: [0.485, 0.456, 0.406] pixel_std: [0.226, 0.226, 0.226] model: input_channel: 3 input_width: 224 input_height: 224 gen_trt_engine: gpu_id: 0 onnx_file: /path/to/exported/onnx/file trt_engine: /path/to/trt/engine/to/generate tensorrt: data_type: int8 workspace_size: 1024 min_batch_size: 1 opt_batch_size: 10 max_batch_size: 10 calibration: cal_cache_file: /path/to/calibration/cache/file/to/generate cal_batch_size: 16 cal_batches: 100 cal_image_dir: - /path/to/calibration/image/folder

Parameter Datatype Default Description Supported Values
data_type string FP32 The precision to be used for the TensorRT engine FP32/FP16/INT8
workspace_size unsigned int 1024 The maximum workspace size for the TensorRT engine >1024
min_batch_size unsigned int 1 The minimum batch size for optimization profile shape >0
opt_batch_size unsigned int 1 The optimal batch size for optimization profile shape >0
max_batch_size unsigned int 1 The maximum batch size for optimization profile shape >0
calibration dict config None The configuration for the INT8 calibration

Calibration Config

Parameter Datatype Default Description Supported Values
cal_cache_file string None The path to calibration cache file. If there’s no calibration cache file at this path, a cache file is generated based on the the other calibration config parameters.
cal_batch_size unsigned int 1 the batch size of calibration dataset >0
cal_batches unsigned int 1 The number of batches used for calibration. In total, there are cal_batches`x:code:`cal_batch_size calibration images used. >0
cal_image_dir string None The directory containing the calibration images

Use the following command to run MLRecogNet engine generation:

Copy
Copied!
            

tao deploy ml_recog gen_trt_engine -e /path/to/spec.yaml \ gen_trt_engine.onnx_file=/path/to/onnx/file \ gen_trt_engine.trt_engine=/path/to/engine/file \ gen_trt_engine.tensorrt.data_type=<data_type>

Required Arguments

  • -e, --experiment_spec: The experiment spec file to set up the TensorRT engine generation. This should be the same as the export specification file.

  • gen_trt_engine.onnx_file: The .onnx model to be converted.

  • gen_trt_engine.trt_engine: The path where the generated engine will be stored.

  • gen_trt_engine.tensorrt.data_type: MLRecogNet supports FP32, FP16 and INT8 TensorRT engine generation. When using INT8, you must provide the calibration dataset or calibration cache file.

Sample Usage

Here’s an example of using the gen_trt_engine command to generate an FP16 TensorRT engine:

Copy
Copied!
            

tao model metric_learning_recognition gen_trt_engine -e $TRT_GEN_SPEC gen_trt_engine.onnx_file=$ONNX_FILE \ gen_trt_engine.trt_engine=$ENGINE_FILE \ gen_trt_engine.tensorrt.data_type=FP16

Here’s an example of output $RESULTS_DIR/status.json:

Copy
Copied!
            

{"date": "6/22/2023", "time": "18:17:11", "status": "STARTED", "verbosity": "INFO", "message": "Starting ml_recog gen_trt_engine."} {"date": "6/22/2023", "time": "18:17:30", "status": "SUCCESS", "verbosity": "INFO", "message": "Gen_trt_engine finished successfully."}

The output log example is shown below:

Copy
Copied!
            

Starting ml_recog gen_trt_engine. [06/22/2023-18:17:12] [TRT] [I] [MemUsageChange] Init CUDA: CPU +318, GPU +0, now: CPU 356, GPU 1003 (MiB) [06/22/2023-18:17:14] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +443, GPU +116, now: CPU 853, GPU 1119 (MiB) [06/22/2023-18:17:14] [TRT] [W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See `CUDA_MODULE_LOADING` in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars Parsing ONNX model [06/22/2023-18:17:14] [TRT] [W] The NetworkDefinitionCreationFlag::kEXPLICIT_PRECISION flag has been deprecated and has no effect. Please do not use this flag when creating the network. [06/22/2023-18:17:15] [TRT] [W] onnx2trt_utils.cpp:377: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32. Network Description Input 'input' with shape (-1, 3, 224, 224) and dtype DataType.FLOAT Output 'fc_pred' with shape (-1, 256) and dtype DataType.FLOAT dynamic batch size handling TensorRT engine build configurations: OptimizationProfile: "input": (1, 3, 224, 224), (10, 3, 224, 224), (10, 3, 224, 224) BuilderFlag.TF32 Note: max representabile value is 2,147,483,648 bytes or 2GB. MemoryPoolType.WORKSPACE = 1073741824 bytes MemoryPoolType.DLA_MANAGED_SRAM = 0 bytes MemoryPoolType.DLA_LOCAL_DRAM = 1073741824 bytes MemoryPoolType.DLA_GLOBAL_DRAM = 536870912 bytes Tactic Sources = 31 [06/22/2023-18:17:17] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +854, GPU +362, now: CPU 1800, GPU 1481 (MiB) [06/22/2023-18:17:17] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +126, GPU +58, now: CPU 1926, GPU 1539 (MiB) [06/22/2023-18:17:17] [TRT] [I] Local timing cache in use. Profiling results in this builder pass will not be stored. [06/22/2023-18:17:22] [TRT] [I] Some tactics do not have sufficient workspace memory to run. Increasing workspace size will enable more tactics, please check verbose output for requested sizes. [06/22/2023-18:17:30] [TRT] [I] Total Activation Memory: 1565556736 [06/22/2023-18:17:30] [TRT] [I] Detected 1 inputs and 1 output network tensors. [06/22/2023-18:17:30] [TRT] [I] Total Host Persistent Memory: 132192 [06/22/2023-18:17:30] [TRT] [I] Total Device Persistent Memory: 140288 [06/22/2023-18:17:30] [TRT] [I] Total Scratch Memory: 134217728 [06/22/2023-18:17:30] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 9 MiB, GPU 658 MiB [06/22/2023-18:17:30] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 91 steps to complete. [06/22/2023-18:17:30] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 1.66392ms to assign 5 blocks to 91 nodes requiring 184394240 bytes. [06/22/2023-18:17:30] [TRT] [I] Total Activation Memory: 184394240 [06/22/2023-18:17:30] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 2491, GPU 1889 (MiB) [06/22/2023-18:17:30] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +101, now: CPU 0, GPU 101 (MiB) Export finished successfully. Gen_trt_engine finished successfully.

Same spec file as TAO evaluation spec file. The following is a sample spec file $EVAL_SPEC:

Copy
Copied!
            

results_dir: /path/to/output_dir evaluate: trt_engine: /path/to/generated/trt_engine batch_size: 8 topk: 5 dataset: val_dataset: reference: /path/to/reference/set query: /path/to/query/set

Use the following command to run Deformable DETR engine evaluation:

Copy
Copied!
            

tao deploy ml_recog evaluate -e /path/to/spec.yaml \ evaluate.trt_engine=/path/to/engine/file \ results_dir=/path/to/outputs

Required Arguments

  • -e, --experiment_spec: The experiment spec file for evaluation. This should be the same as the tao evaluate specification file.

  • evaluate.trt_engine: The engine file to run evaluation

  • results_dir: The directory where evaluation results will be stored. If not provided, the results will be stored in evaluate.results_dir. Thus at least one of the of results_dir is needed.

Sample Usage

In the following example, the evaluate command is used to run evaluation with the TensorRT engine:

Copy
Copied!
            

tao deploy ml_recog evaluate -e $EVAL_SPEC evaluate.trt_engine=$ENGINE_FILE \ results_dir=$RESULTS_DIR

Here’s an example of output $RESULTS_DIR/status.json:

Copy
Copied!
            

{"date": "3/30/2023", "time": "6:7:14", "status": "STARTED", "verbosity": "INFO", "message": "Starting ml_recog evaluation."} {"date": "3/30/2023", "time": "6:7:24", "status": "SUCCESS", "verbosity": "INFO", "message": "Evaluation finished successfully."}

The output log example is shown below:

Copy
Copied!
            

Starting ml_recog evaluation. [06/22/2023-20:41:53] [TRT] [W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See `CUDA_MODULE_LOADING` in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars [06/22/2023-20:41:53] [TRT] [W] The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1. [06/22/2023-20:41:53] [TRT] [W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See `CUDA_MODULE_LOADING` in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars Loading gallery dataset... ... Top 1 scores: 0.9958333333333333 Top 5 scores: 1.0 Confusion Matrix [[ 34 0 0 0 0] [ 0 106 0 0 0] [ 0 0 29 0 0] [ 0 0 0 31 0] [ 0 0 0 1 47]] Classification Report precision recall f1-score support c000001 1.00 1.00 1.00 34 c000002 1.00 1.00 1.00 106 c000003 1.00 1.00 1.00 29 c000004 0.97 1.00 0.98 31 c000005 1.00 0.98 0.99 48 accuracy 1.00 248 macro avg 0.99 1.00 0.99 248 weighted avg 1.00 1.00 1.00 248 Finished evaluation. Evaluation finished successfully.

Same spec file as TAO inference spec file. Sample spec file $INFERENCE_SPEC:

Copy
Copied!
            

results_dir: "/path/to/output_dir" model: input_channels: 3 input_width: 224 input_height: 224 inference: trt_engine: "/path/to/generated/trt_engine" batch_size: 10 inference_input_type: classification_folder topk: 5 dataset: val_dataset: reference: "/path/to/reference/set" query: ""

Use the following command to run MLRecogNet engine inference:

Copy
Copied!
            

tao deploy ml_recog inference -e /path/to/spec.yaml \ inference.trt_engine=/path/to/engine/file \ results_dir=/path/to/outputs

Required Arguments

  • -e, --experiment_spec: The experiment spec file for inference. This should be the same as the tao inference specification file.

  • inference.trt_engine: The engine file to run inference.

  • results_dir: The directory where inference results will be stored.

Sample Usage

In the following example, the inference command is used to run inference with the TensorRT engine:

Copy
Copied!
            

tao deploy ml_recog inference -e $INFERENCE_SPEC inference.trt_engine=$ENGINE_FILE \ results_dir=$RESULTS_DIR

The JSON format results will be stored under $RESULTS_DIR/trt_inference.

Here’s an example of output $RESULTS_DIR/status.json:

Copy
Copied!
            

{"date": "6/22/2023", "time": "20:46:38", "status": "STARTED", "verbosity": "INFO", "message": "Starting ml_recog inference."} {"date": "6/22/2023", "time": "20:46:53", "status": "SUCCESS", "verbosity": "INFO", "message": "Inference finished successfully."}

The output log example is shown below:

Copy
Copied!
            

Starting ml_recog inference. [06/22/2023-20:46:39] [TRT] [W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See `CUDA_MODULE_LOADING` in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars [06/22/2023-20:46:39] [TRT] [W] The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1. [06/22/2023-20:46:39] [TRT] [W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See `CUDA_MODULE_LOADING` in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars Loading gallery dataset... ... Finished inference. Inference finished successfully.

Previous Mask RCNN with TAO Deploy
Next Multitask Image Classification with TAO Deploy
© Copyright 2024, NVIDIA. Last updated on Mar 18, 2024.