Performance
DeepStream application is benchmarked across various NVIDIA TLT and open source models. The measured performance represents end-to-end performance of the entire video analytic application considering video capture and decode, pre-processing, batching, inference, and post-processing to generate metadata. The output rendering has been turned off to achieve peak inference performance. For information on disabling the output rendering, see
DeepStream Reference Application - deepstream-app chapter.
TLT Pre-trained models
Transfer Learning Toolkit (TLT) has a set of pretrained models listed in the table below. If the models below satisfy your requirement, you should start with one of them. These could be used for various applications in smart city or smart places. If your application is beyond the scope of these models, you can re-train one of the popular model architecture using TLT. The second table shows the expected performance of few of other TLT models.
The table below shows the end-to-end performance on highly accurate pre-trained models from TLT. All models are available on NGC. These models are natively integrated with DeepStream and the instructions to run these models are in /opt/nvidia/deepstream/deepstream-5.0/samples/configs/tlt_pretrained_models/
| | | Jetson Nano | Jetson Xavier NX | Jetson AGX Xavier | T4 |
Model Arch | Inference resolution | Precision | GPU (FPS) | GPU (FPS) | DLA1 (FPS) | DLA2 (FPS) | GPU (FPS) | DLA1 (FPS) | DLA2 (FPS) | GPU (FPS) |
| 960x544 | INT8 | 10* | 157 | 51 | 51 | 272 | 67 | 67 | 807 |
| 960x544 | INT8 | 14* | 218 | 72 | 72 | 384 | 94 | 94 | 1105 |
| 960x544 | INT8 | 19* | 261 | 105 | 105 | 464 | 140 | 140 | 1300 |
| 960x544 | INT8 | 18* | 252 | 102 | 102 | 442 | 133 | 133 | 1280 |
| 384x240 | INT8 | 95* | 1188 | 570 | 570 | 2006 | 750 | 750 | 2530 |
Notes: * - Inference using FP16 on Jetson Nano |
All the models in the table above can run solely on DLA. This saves valuable GPU resources to run more complex models.
TLT also supports training on popular Detection and Segmentation architectures. To learn more about how to train with TLT, refer to the
TLT documentation. These models are natively integrated with DeepStream. These are just reference models and are available to download from
GitHub.
| | | Jetson Nano | Jetson Xavier NX | Jetson AGX Xavier | T4 |
Model Arch | Inference resolution | Precision | GPU (FPS) | GPU (FPS) | DLA1 (FPS) | DLA2 (FPS) | GPU (FPS) | DLA1 (FPS) | DLA2 (FPS) | GPU (FPS) |
YoloV3 – ResNet18 | 960x544 | INT8 | 11* | 78 | 55 | 55 | 223 | 84 | 84 | 620 |
FasterRCNN – ResNet10 | 480x272 | INT8 | 16* | 127 | N/A | N/A | 281 | N/A | N/A | 932 |
SSD – ResNet18 | 960x544 | INT8 | 10.6* | 124 | 56 | 56 | 216 | 77 | 77 | 760 |
DSSD – ResNet18 | 960x544 | INT8 | 9* | 66 | 45 | 45 | 189 | 67 | 67 | 586 |
RetinaNet – ResNet18 | 960x544 | INT8 | 8.5* | 60 | 45 | 45 | 147 | 41 | 41 | 296 |
MaskRCNN – ResNet50 | 1344x832 | INT8 | 0.6* | 5.4 | 3.2 | 3.2 | 9.2 | 4.5 | 4.5 | 24 |
Note: * - Inference using FP16 on Jetson Nano. |
Note: | FasterRCNN model will not run efficiently on the DLA due to multiple layers not supported on the DLA. |
DeepStream reference model and tracker
DeepStream SDK ships with a reference DetectNet_v2-ResNet10 model and 3 ResNet18 classifier models. The detailed instructions to run these models with DeepStream are provided in the next section. The table below shows the performance of these models along with various trackers. DeepStream provides 3 reference trackers: IoU, KLT and NvDCF. For more information about trackers, See the Gst-nvtracker section in DeepStream Plugins Development Guide.
| | | Jetson Nano | Jetson Xavier NX | Jetson AGX Xavier | T4 |
Model Arch | Tracker | Inference resolution | GPU (FPS) | GPU (FPS) | DLA1 (FPS) | DLA2 (FPS) | GPU (FPS) | DLA1 (FPS) | DLA2 (FPS) | GPU (FPS) |
DetectNetV2 – ResNet10 | No Tracker | 480x272 | 92 | 1079 | 510 | 510 | 1800 | 690 | 690 | 2520 |
DetectNetV2 – ResNet10 | KLT | 480x272 | 91 | 512† | 277 | 277 | 892† | 475 | 475 | 2316 |
DetectNetV2 – ResNet10 | NvDCF | 480x272 | 40 | 356† | 170 | 170 | 812† | 244 | 244 | 1293 |
DetectNetV2 – ResNet10 + ResNet18 classifiers | No Tracker | 480x272 + 224x224 | 50 | 718 | N/A | N/A | 1386 | N/A | N/A | 2278 |
DetectNetV2 – ResNet10 (Python App) | No Tracker | 480x272 | 92 | 1057 | N/A | N/A | 1744 | N/A | N/A | 2250 |
Note: † - Performance bottleneck identified which will be fixed in future release. |
Note: | • All inferences are done using INT8 precision except on Nano. On Nano, it is FP16. • Running inference simultaneously on multiple models is not supported on the DLA. You can only run one model at a time on the DLA. |
To achieve the peak performance shown in the tables above, make sure the devices are properly cooled. For T4, make sure you use a server that meets the thermal and airflow requirements of T4.
Apart from hardware setup, few options in the config file needs to be set to achieve the published performance. Take one of the config files from DeepStream SDK and make the required changes to replicate the peak performance below:
Turn off output rendering, OSD, and tiler. OSD which stands for on-screen display is used to display bounding box, masks, and labels on the screen. If output rendering is disabled, creating bounding boxes is not required unless the output needs to be streamed over RTSP or saved to disk. Tiler is used to display the output in NxM tiled grid. Not needed if rendering is disabled. Output rendering, OSD and tiler use some % of compute resources and it can reduce the inference performance.
To disable OSD, tiled display and output sink, make the following changes in the DeepStream config file.
• To disable OSD, change enable to 0
[osd]
enable=0
• To disable tiling, change enable to 0
[tiled-display]
enable=0
• To turn-off output rendering, change the sink to fakesink.
[sink0]
enable=1
#Type - 1=FakeSink 2=EglSink 3=File
type=1
sync=0
Tesla
This section describes configuration and settings for the DeepStream SDK on NVIDIA® Tesla®.
System Configuration
The system configuration for the DeepStream SDK is listed below:
System Configuration | Specification |
---|
CPU | Dual Intel® Xeon® CPU E5-2650 v4 @ 2.20GHz (48 threads total) |
GPU | Tesla T4* |
System Memory | 128 GB DDR4, 2400MHz |
Ubuntu | Ubuntu 18.04 |
GPU Driver | 450.51 |
CUDA | 10.2 |
TensorRT | 7.0+ |
GPU clock frequency | 1.3 GHz |
Note: * - For best performance with NVIDIA T4, make sure you use a server that meets the thermal and airflow requirements for NVIDIA T4. |
Application Configuration
Config file: source4_1080p_dec_infer-resnet_tracker_sgie_tiled_display_int8.txt
Change the following items in the config file:
• The inference resolution of Primary GIE is specified in the ‘samples/models/Primary_detector/resnet10.prototxt’.
• Change the ‘dim’ to 480x272.
• Change batch size under ‘streammux’ and ‘primary-gie’ to match the number of streams.
• Disable tiled display and rendering using instructions above.
• Enable IoU tracker.
The application configuration for the DeepStream SDK is listed below:
Application Configuration | Specification |
---|
N×1080p 30 fps stream | sample_1080p_h265.mp4 (provided with the SDK) N=64 sample_1080p_h264.mp4 (provided with the SDK) N=39 |
Primary GIE | • Resnet10 (480×272) • Batch Size = N • Interval=0 |
Tracker | Enabled. Processing at 480×272 resolution, IOU tracker enabled. |
3 × Secondary GIEs | All batches size 32. Asynchronous mode enabled. • Secondary_VehicleTypes (224×224—Resnet18) • Secondary_CarColor (224×224—Resnet18) • Secondary_CarMake (224×224—Resnet18) |
Tiled Display | Disabled |
Rendering | Disabled |
Achieved Performance
The achieved performance of the DeepStream SDK under the specified system and application configuration are as follows:
Stream Type | No. of Streams @ 30 FPS | CPU Utilization | GPU Utilization |
H.265 | 64 | 8% to 10% | 58% |
H.264 | 39 | 5% | 31% |
Jetson Performance
This section describes configuration and settings for the DeepStream SDK on NVIDIA Jetson™ platforms. JetPack 4.4 DP is used for software installation.
System Configuration
For the performance test:
1. Max power mode is enabled:
$ sudo nvpmodel -m 0
2. The GPU clocks are stepped to maximum:
$ sudo jetson_clocks
For information about supported power modes, see “Supported Modes and Power Efficiency” in the power management topics of NVIDIA Tegra Linux Driver Package Development Guide, e.g., “Power Management for Jetson AGX Xavier Devices.”
Jetson Nano
Config file: source8_1080p_dec_infer-resnet_tracker_tiled_display_fp16_nano.txt
Change the following items in the config file:
• Change batch size under ‘streammux’ and ‘primary-gie’ to match the number of streams.
• Disable tiled display and rendering using instructions above.
• Enable KLT tracker and change the tracker resolution to 480x272.
The following tables describe performance results for the NVIDIA Jetson Nano.
Pipeline Configuration (deepstream-app) |
---|
Application Configuration | Specification |
---|
N×1080p 30 fps streams | sample_1080p_h265.mp4 (provided with the SDK) N = 8 sample_1080p_h264.mp4 (provided with the SDK) N = 8 |
Primary GIE | • Resnet10 (480×272) Asynchronous mode enabled • Batch Size = N • Interval = 4 |
Tracker | Enabled; processing at 480×272 resolution, KLT tracker enabled. |
OSD/tiled display | Disabled |
Renderer | Disabled |
Achieved Performance |
Stream Type | No. of Streams @ 30 FPS | CPU Utilization | GPU Utilization |
H.265 | 8 | 39% | 67% |
H.264 | 8 | 39% | 65% |
Jetson AGX Xavier
Config file: source4_1080p_dec_infer-resnet_tracker_sgie_tiled_display_int8.txt
Change the following items in the config file:
• The inference resolution of Primary GIE is specified in the ‘samples/models/Primary_detector/resnet10.prototxt’.
• Change the ‘dim’ to 480x272.
• Change batch size under ‘streammux’ and ‘primary-gie’ to match the number of streams.
• Disable tiled display and rendering using instructions above.
• Enable IOU tracker.
The following tables describe performance results for the NVIDIA Jetson AGX Xavier™.
Pipeline Configuration (deepstream-app) |
---|
Application Configuration | Specification |
---|
N×1080p 30 fps streams | sample_1080p_h265.mp4 (provided with the SDK) N=45 sample_1080p_h264.mp4 (provided with the SDK) N=32 |
Primary GIE | • Resnet10 (480×272) Asynchronous mode enabled • Batch Size = N • Interval = 0 |
Tracker | Enabled; processing at 480×272 resolution, IOU tracker enabled. |
3× secondary GIEs | All batches are size 32. • Secondary_VehicleTypes (224×224—Resnet18) • Secondary_CarColor (224×224—Resnet18) • Secondary_CarMake (224×224—Resnet18) |
OSD/tiled display | Disabled |
Renderer | Disabled |
Achieved Performance |
Stream Type | No. of Streams @ 30 FPS | CPU Utilization | GPU Utilization |
H.265 | 45 | 22% | 95% |
H.264 | 32 | 19% | 71% |
Jetson NX
Config file: source4_1080p_dec_infer-resnet_tracker_sgie_tiled_display_int8.txt
Change the following items in the config file:
• The inference resolution of Primary GIE is specified in the ‘samples/models/Primary_detector/resnet10.prototxt’.
• Change the ‘dim’ to 480x272.
• Change batch size under ‘streammux’ and ‘primary-gie’ to match the number of streams.
• Disable tiled display and rendering using instructions above.
• Enable IOU tracker.
The following tables describe performance results for the NVIDIA® Jetson NX™.
Pipeline Configuration (deepstream-app) |
---|
Application Configuration | Specification |
---|
N×1080p 30 fps streams | sample_1080p_h265.mp4 (provided with the SDK) N=23 sample_1080p_h264.mp4 (provided with the SDK) N=16 |
Primary GIE | • Resnet10 (480×272) Asynchronous mode enabled • Batch Size = N • Interval = 0 |
Tracker | Enabled; processing at 480×272 resolution, IOU tracker enabled. |
3× secondary GIEs | All batches are size 32. • Secondary_VehicleTypes (224×224—Resnet18) • Secondary_CarColor (224×224—Resnet18) • Secondary_CarMake (224×224—Resnet18) |
OSD/tiled display | Disabled |
Renderer | Disabled |
Achieved Performance |
Stream Type | No. of Streams @ 30 FPS | CPU Utilization | GPU Utilization |
H.265 | 23 | 55% | 93% |
H.264 | 16 | 45% | 65% |
Jetson TX2
Config file: source12_1080p_dec_infer-resnet_tracker_tiled_display_fp16_tx2.txt
Change batch size under ‘streammux’ and ‘primary-gie’ to match the number of streams. Disable tiled display and rendering using instructions above. Enable KLT tracker and change the tracker resolution to 480x272.
The following tables describe performance results for the NVIDIA Jetson TX2.
Pipeline Configuration (deepstream-app) |
---|
Application Configuration | Specification |
---|
N×1080p 30 fps streams | sample_1080p_h265.mp4 (provided with the SDK) N = 15 sample_1080p_h264.mp4 (provided with the SDK) N = 14 |
Primary GIE | • Resnet10 (480×272) Asynchronous mode enabled • Batch Size = N • Interval = 4 |
Tracker | Enabled; processing at 480×272 resolution, KLT tracker enabled |
OSD/tiled display | Disabled |
Renderer | Disabled |
Achieved Performance |
Stream Type | No. of Streams @ 30 FPS | CPU Utilization | GPU Utilization |
H.265 | 15 | 35% | 47% |
H.264 | 14 | 34% | 43% |
Jetson TX1
Config file: source8_1080p_dec_infer-resnet_tracker_tiled_display_fp16_tx1.txt
Change batch size under ‘streammux’ and ‘primary-gie’ to match the number of streams. Disable tiled display and rendering using instructions above. Enable KLT tracker and change the tracker resolution to 480x272.
The following tables describe performance results for the NVIDIA Jetson TX1.
Pipeline Configuration (deepstream-app) |
---|
Application Configuration | Specification |
---|
N×1080p 30 fps streams | sample_1080p_h265.mp4 (provided with the SDK) N = 13 sample_1080p_h264.mp4 (provided with the SDK) N = 10 |
Primary GIE | • Resnet10 (480×272) Asynchronous mode enabled • Batch Size = N • Interval = 4 |
Tracker | Enabled; processing at 480×272 resolution, KLT tracker enabled |
OSD/tiled display | Disabled |
Renderer | Disabled |
Achieved Performance |
Stream Type | No. of Streams @ 30 FPS | CPU Utilization | GPU Utilization |
H.265 | 13 | 56% | 49% |
H.264 | 10 | 43% | 43% |
Running applications using DLA
Jetson AGX Xavier and Jetson NX supports 2 DLA engines. DeepStream does support inferencing using GPU and DLAs in parallel. You can do this in separate processes or single process. You will need three separate sets of configs, configured to run on GPU, DLA0 and DLA1.
• Separate processes:
when GPU and DLA are run in separate processes, set the environment variable CUDA_DEVICE_MAX_CONNECTIONS as 1 from the terminal where DLA config is running.
• Single process:
DeepStream reference application supports multiple configs in the same process.
To run DLA and GPU in same process, set environment variable CUDA_DEVICE_MAX_CONNECTIONS as 32:
$ deepstream-app -c <gpuconfig> -c <dla0config> -c<dla1config>