Performance¶
DeepStream application is benchmarked across various NVIDIA TAO Toolkit and open source models. The measured performance represents end-to-end performance of the entire video analytic application considering video capture and decode, pre-processing, batching, inference, and post-processing to generate metadata. The output rendering is turned off to achieve peak inference performance. For information on disabling the output rendering, see DeepStream Reference Application - deepstream-app chapter.
TAO Toolkit Pre-trained models¶
TAO toolkit has a set of pretrained models listed in the table below. If the models below satisfy your requirement, you should start with one of them. These could be used for various applications in smart city or smart places. If your application is beyond the scope of these models, you may re-train one of the popular model architecture using TAO toolkit. The second table shows the expected performance of few of other TAO toolkit models.
The table below shows the end-to-end performance on highly accurate pre-trained models from TAO toolkit. All models are available on NGC. These models are natively integrated with DeepStream and the instructions to run these models are in /opt/nvidia/deepstream/deepstream-5.1/samples/configs/tlt_pretrained_models/
.
¶ Jetson
Nano
Jetson
Tx2
Jetson
Xavier NX
Jetson
AGX
Xavier
T4
A100
PCIe
Model Arch
Inference resolution
Precision
GPU
(FPS)
GPU
(FPS)
GPU
(FPS)
DLA1 (FPS)
DLA2 (FPS)
GPU (FPS)
DLA1 (FPS)
DLA2 (FPS)
GPU (FPS)
GPU (FPS)
960x544
INT8
10.7
28
168
54
54
292
70
70
890
3392
960x544
INT8
13.9
35
218
72
72
395
97
97
1086
3841
960x544
INT8
19.5
52
264
105
105
478
140
140
1358
4013
960x544
INT8
17.8
46
254
100
100
453
133
133
1320
3993
384x240
INT8
101
275
1192
553
553
2010
754
754
2568
5549
All the models in the table above can run solely on DLA. This saves valuable GPU resources to run more complex models. TAO toolkit also supports training on popular Detection and Segmentation architectures. To learn more about how to train with TAO toolkit, refer to the TAO toolkit documentation. These models are natively integrated with DeepStream. These reference models are available to download from GitHub.
¶ Jetson Nano
Jetson Xavier NX
Jetson AGX Xavier
T4
Model Arch
Inference resolution
Precision
GPU (FPS)
GPU (FPS)
DLA1 (FPS)
DLA2 (FPS)
GPU (FPS)
DLA1 (FPS)
DLA2 (FPS)
GPU (FPS)
YoloV3 – ResNet18
960x544
INT8
11
78
55
55
223
84
84
620
FasterRCNN – ResNet10
480x272
INT8
16
127
N/A
N/A
281
N/A
N/A
932
SSD – ResNet18
960x544
INT8
10.6
124
56
56
216
77
77
760
DSSD – ResNet18
960x544
INT8
9
66
45
45
189
67
67
586
RetinaNet – ResNet18
960x544
INT8
8.5
60
45
45
147
41
41
296
MaskRCNN – ResNet50
1344x832
INT8
0.6
5.4
3.2
3.2
9.2
4.5
4.5
24
Note
FasterRCNN model will not run efficiently on the DLA due to multiple layers not supported on the DLA.
All inferences on Jetson Nano is done using FP16 precision.
DeepStream reference model and tracker¶
DeepStream SDK ships with a reference DetectNet_v2-ResNet10 model and three ResNet18 classifier models. The detailed instructions to run these models with DeepStream are provided in the next section. The table below shows the performance of these models along with various trackers. DeepStream provides three reference trackers: IoU, KLT and NvDCF. For more information about trackers, See the Gst-nvtracker section.
¶ Jetson
Nano
Jetson
Tx2
Jetson
Xavier NX
Jetson
AGX
Xavier
T4
A100
PCIe
Model Arch
Tracker
Inference resolution
GPU
(FPS)
GPU
(FPS)
GPU
(FPS)
DLA1 (FPS)
DLA2 (FPS)
GPU
(FPS)
DLA1 (FPS)
DLA2 (FPS)
GPU
(FPS)
GPU
(FPS)
DetectNetV2 – ResNet10
No Tracker
480x272
92
241
1075
502
502
1815
692
692
2547
5541
DetectNetV2 – ResNet10
KLT
480x272
90
231
512
276
276
905
475
475
2236
1345
DetectNetV2 – ResNet10
NvDCF
480x272
48
160
522
265
265
921
460
460
2334
4819
DetectNetV2 – ResNet10 + ResNet18 classifiers
IoU
480x272 + 224x224
51
136
720
N/A
N/A
1399
N/A
N/A
2342
5534
DetectNetV2 – ResNet10 + ResNet18 classifiers
KLT
480x272 + 224x224
48
126
442
N/A
N/A
780
N/A
N/A
2290
1373
DetectNetV2 – ResNet10 + ResNet18 classifiers
NvDCF
480x272 + 224x224
34
99
415
N/A
N/A
766
N/A
N/A
2326
4829
Note
* - Performance bottleneck identified which will be fixed in future release.
All inferences are done using INT8 precision except on Jetson Nano™. On Nano, it is FP16.
Running inference simultaneously on multiple models is not supported on the DLA. You can only run one model at a time on the DLA.
To achieve the peak performance shown in the tables above, make sure the devices are properly cooled. For T4, make sure you use a server that meets the thermal and airflow requirements. Along with the hardware setup, a few other options in the config file need to be set to achieve the published performance. Make the required changes to one of the config files from DeepStream SDK to replicate the peak performance.
Turn off output rendering, OSD, and tiler
OSD (on-screen display) is used to display bounding box, masks, and labels on the screen. If output rendering is disabled, creating bounding boxes is not required unless the output needs to be streamed over RTSP or saved to disk. Tiler is used to display the output in NxM tiled grid. It is not needed if rendering is disabled. Output rendering, OSD and tiler use some percentage of compute resources, so it can reduce the inference performance.
To disable OSD, tiled display and output sink, make the following changes in the DeepStream config file.
To disable OSD, change enable to 0
[osd] enable=0To disable tiling, change enable to 0
[tiled-display] enable=0To turn-off output rendering, change the sink to fakesink.
[sink0] enable=1 #Type - 1=FakeSink 2=EglSink 3=File type=1 sync=0
DeepStream reference model¶
Data center GPU - GA100¶
This section describes configuration and settings for the DeepStream SDK on NVIDIA Data center GPU - GA100.
System Configuration¶
The system configuration for the DeepStream SDK is listed below:
¶ System Configuration
Specification
CPU
AMD EPYC 7742@2.25GHz 3.4GHz Turbo (Rome) HT Off
GPU
A100-PCIE-40GB(GA100) 1*40537 MiB 1*108 SM
Ubuntu
Ubuntu 18.04
GPU Driver
460.32
CUDA
11.1
TensorRT
7.2.X+
GPU clock frequency
1410 MHz
Application Configuration¶
Config file: source4_1080p_dec_infer-resnet_tracker_sgie_tiled_display_int8.txt
Change the following items in the config file:
The inference resolution of Primary GIE is specified in the
samples/models/Primary_detector/resnet10.prototxt
.Change the dim to
480x272
.Change batch size under
streammux
andprimary-gie
to match the number of streams.Disable tiled display and rendering using instructions above.
Enable IoU tracker.
The application configuration for the DeepStream SDK is listed below:
¶ Application Configuration
Specification
N×1080p 30 fps stream
sample_1080p_h265.mp4
(provided with the SDK) N=64sample_1080p_h264.mp4
(provided with the SDK) N=39Primary GIE
Resnet10 (480×272)
Batch Size = N
Interval=0
Tracker
Enabled. Processing at 480×272 resolution, IOU tracker enabled.
3 × Secondary GIEs
- All batches size 32. Asynchronous mode enabled.
Secondary_VehicleTypes (224×224—Resnet18)
Secondary_CarColor (224×224—Resnet18)
Secondary_CarMake (224×224—Resnet18)
Tiled Display
Disabled
Rendering
Disabled
Achieved Performance The table below shows the achieved performance of the DeepStream SDK under the specified system and application configuration:
Stream type |
No. of Stream @ 30 FPS |
CPU Utilization |
GPU Utilization |
---|---|---|---|
H.265 |
158 |
4.5% |
46.08% |
H.264 |
91 |
2.83% |
28.69% |
Data center GPU - T4¶
This section describes configuration and settings for the DeepStream SDK on NVIDIA Data center GPU - T4.
System Configuration¶
The system configuration for the DeepStream SDK is listed below:
¶ System Configuration
Specification
CPU
Dual Intel® Xeon® CPU E5-2650 v4 @ 2.20GHz (48 threads total)
GPU
Tesla T4*
System Memory
128 GB DDR4, 2400MHz
Ubuntu
Ubuntu 18.04
GPU Driver
460.32.X
CUDA
11.1
TensorRT
7.2.X
GPU clock frequency
1513 MHz
Application Configuration¶
Config file: source4_1080p_dec_infer-resnet_tracker_sgie_tiled_display_int8.txt
Change the following items in the config file:
The inference resolution of Primary GIE is specified in the
samples/models/Primary_detector/resnet10.prototxt
.Change the dim to
480x272
.Change batch size under
streammux
andprimary-gie
to match the number of streams.Disable tiled display and rendering using instructions above.
Enable IoU tracker.
The application configuration for the DeepStream SDK is listed below:
¶ Application Configuration
Specification
N×1080p 30 fps stream
sample_1080p_h265.mp4
(provided with the SDK) N=64sample_1080p_h264.mp4
(provided with the SDK) N=39Primary GIE
Resnet10 (480×272)
Batch Size = N
Interval=0
Tracker
Enabled. Processing at 480×272 resolution, IOU tracker enabled.
3 × Secondary GIEs
- All batches size 32. Asynchronous mode enabled.
Secondary_VehicleTypes (224×224—Resnet18)
Secondary_CarColor (224×224—Resnet18)
Secondary_CarMake (224×224—Resnet18)
Tiled Display
Disabled
Rendering
Disabled
Achieved Performance The table below shows the achieved performance of the DeepStream SDK under the specified system and application configuration:
Stream type |
No. of Stream @ 30 FPS |
CPU Utilization |
GPU Utilization |
---|---|---|---|
H.265 |
64 |
8% to 10% |
58% |
H.264 |
39 |
5% |
31% |
Jetson¶
This section describes configuration and settings for the DeepStream SDK on NVIDIA Jetson™ platforms. JetPack 4.5.1 is used for software installation.
System Configuration¶
For the performance test:
Max power mode is enabled:
$ sudo nvpmodel -m 0
The GPU clocks are stepped to maximum:
$ sudo jetson_clocks
For information about supported power modes, see the “Supported Modes and Power Efficiency” section in the power management topics of NVIDIA Tegra Linux Driver Package Development Guide, e.g., “Power Management for Jetson AGX Xavier Devices.”
Jetson Nano¶
Config file: source8_1080p_dec_infer-resnet_tracker_tiled_display_fp16_nano.txt
Change the following items in the config file:
Change batch size under
streammux
andprimary-gie
to match the number of streams.Disable tiled display and rendering using instructions above.
Enable KLT tracker and change the tracker resolution to
480x272
.
The following tables describe performance results for the NVIDIA Jetson Nano.
¶ Application Configuration
Specification
N×1080p 30 fps streams
sample_1080p_h265.mp4
(provided with the SDK) N = 8sample_1080p_h264.mp4
(provided with the SDK) N = 8Primary GIE
Resnet10 (480×272) Asynchronous mode enabled
Batch Size = N
Interval = 4
Tracker
Enabled; processing at 480×272 resolution, KLT tracker enabled
OSD/tiled display
Disabled
Renderer
Disabled
Achieved Performance
Stream type |
No. of Stream @ 30 FPS |
CPU Utilization |
GPU Utilization |
---|---|---|---|
H.265 |
8 |
39% |
67% |
H.264 |
8 |
39% |
65% |
Jetson AGX Xavier¶
Config file: source4_1080p_dec_infer-resnet_tracker_sgie_tiled_display_int8.txt
Change the following items in the config file:
The inference resolution of Primary GIE is specified in the ‘samples/models/Primary_detector/resnet10.prototxt’.
Change the dim to
480x272
.Change batch size under
streammux
andprimary-gie
to match the number of streams.Disable tiled display and rendering using instructions above.
Enable IOU tracker.
The following tables describe performance results for the NVIDIA Jetson AGX Xavier™.
¶ Application Configuration
Specification
N×1080p 30 fps streams
sample_1080p_h265.mp4
(provided with the SDK) N=45sample_1080p_h264.mp4
(provided with the SDK) N=32Primary GIE
Resnet10 (480×272) Asynchronous mode enabled
Batch Size = N
Interval = 0
Tracker
Enabled; processing at 480×272 resolution, IOU tracker enabled.
3× secondary GIEs
All batches are size 32.
Secondary_VehicleTypes (224×224—Resnet18)
Secondary_CarColor (224×224—Resnet18)
Secondary_CarMake (224×224—Resnet18)
OSD/tiled display
Disabled
Renderer
Disabled
Achieved Performance
Stream type |
No. of Stream @ 30 FPS |
CPU Utilization |
GPU Utilization |
---|---|---|---|
H.265 |
45 |
22% |
95% |
H.264 |
32 |
19% |
71% |
Jetson NX¶
Config file: source4_1080p_dec_infer-resnet_tracker_sgie_tiled_display_int8.txt
Change the following items in the config file:
The inference resolution of Primary GIE is specified in the ‘samples/models/Primary_detector/resnet10.prototxt’.
Change the dim to
480x272
.Change batch size under
streammux
andprimary-gie
to match the number of streams.Disable tiled display and rendering using instructions above.
Enable IOU tracker.
The following tables describe performance results for the NVIDIA Jetson NX™.
Application Configuration |
Specification |
---|---|
N×1080p 30 fps streams |
sample_1080p_h265.mp4 (provided with the SDK) N=23 sample_1080p_h264.mp4 (provided with the SDK) N=16 |
Primary GIE |
|
Tracker |
Enabled; processing at 480×272 resolution, IOU tracker enabled. |
3× secondary GIEs |
All batches are size 32.
|
OSD/tiled display |
Disabled |
Renderer |
Disabled |
Achieved Performance
Stream type |
No. of Stream @ 30 FPS |
CPU Utilization |
GPU Utilization |
---|---|---|---|
H.265 |
23 |
55% |
93% |
H.264 |
16 |
45% |
65% |
Jetson TX2¶
Config file: source12_1080p_dec_infer-resnet_tracker_tiled_display_fp16_tx2.txt
Change the following in the config file:
Change batch size under
streammux
andprimary-gie
to match the number of streams.Disable tiled display and rendering using instructions above.
Enable KLT tracker and change the tracker resolution to 480x272.
The following tables describe performance results for the Jetson™ TX2.
Application Configuration |
Specification |
---|---|
N×1080p 30 fps streams |
sample_1080p_h265.mp4 (provided with the SDK) N=15sample_1080p_h264.mp4 (provided with the SDK) N=14 |
Primary GIE |
|
Tracker |
Enabled; processing at 480×272 resolution, KLT tracker enabled |
OSD/tiled display |
Disabled |
Renderer |
Disabled |
Achieved Performance
Stream type |
No. of Stream @ 30 FPS |
CPU Utilization |
GPU Utilization |
---|---|---|---|
H.265 |
15 |
35% |
47% |
H.264 |
14 |
34% |
43% |
Jetson TX1¶
Config file: source8_1080p_dec_infer-resnet_tracker_tiled_display_fp16_tx1.txt
Change the following in the config file:
Change batch size under
streammux
andprimary-gie
to match the number of streams.Disable tiled display and rendering using instructions above.
Enable KLT tracker and change the tracker resolution to 480x272.
The following tables describe performance results for the Jetson™ TX1.
Application Configuration |
Specification |
---|---|
N×1080p 30 fps streams |
sample_1080p_h265.mp4 (provided with the SDK) N=13sample_1080p_h264.mp4 (provided with the SDK) N=10 |
Primary GIE |
|
Tracker |
Enabled; processing at 480×272 resolution, KLT tracker enabled |
OSD/tiled display |
Disabled |
Renderer |
Disabled |
Achieved Performance
Stream type |
No. of Stream @ 30 FPS |
CPU Utilization |
GPU Utilization |
---|---|---|---|
H.265 |
13 |
56% |
49% |
H.264 |
10 |
43% |
43% |
Running applications using DLA¶
Jetson AGX Xavier and Jetson NX support 2 DLA engines. DeepStream does support inferencing using GPU and DLAs in parallel. You can do this in separate processes or single process. You will need three separate sets of configs configured to run on GPU, DLA0 and DLA1:
Separate processes:
When GPU and DLA are run in separate processes, set the environment variable
CUDA_DEVICE_MAX_CONNECTIONS
as1
from the terminal where DLA config is running.
Single process:
DeepStream reference application supports multiple configs in the same process. To run DLA and GPU in same process, set environment variable
CUDA_DEVICE_MAX_CONNECTIONS
as32
:$ deepstream-app -c <gpuconfig> -c <dla0config> -c<dla1config>