Performance

DeepStream application is benchmarked across various NVIDIA TAO Toolkit and open source models. The measured performance represents end-to-end performance of the entire video analytic application considering video capture and decode, pre-processing, batching, inference, and post-processing to generate metadata. The output rendering is turned off to achieve peak inference performance. For information on disabling the output rendering, see DeepStream Reference Application - deepstream-app chapter.

TAO Toolkit Pre-trained models

TAO toolkit has a set of pretrained models listed in the table below. If the models below satisfy your requirement, you should start with one of them. These could be used for various applications in smart city or smart places. If your application is beyond the scope of these models, you may re-train one of the popular model architecture using TAO toolkit. The second table shows the expected performance of few of other TAO toolkit models. The table below shows the end-to-end performance on highly accurate pre-trained models from TAO toolkit. All models are available on NGC. These models are natively integrated with DeepStream and the instructions to run these models are in /opt/nvidia/deepstream/deepstream-5.1/samples/configs/tlt_pretrained_models/.

Performance - pretrained models

Jetson

Nano

Jetson

Tx2

Jetson

Xavier NX

Jetson

AGX

Xavier

T4

A100

PCIe

Model Arch

Inference resolution

Precision

GPU

(FPS)

GPU

(FPS)

GPU

(FPS)

DLA1 (FPS)

DLA2 (FPS)

GPU (FPS)

DLA1 (FPS)

DLA2 (FPS)

GPU (FPS)

GPU (FPS)

PeopleNet- ResNet34

960x544

INT8

10.7

28

168

54

54

292

70

70

890

3392

PeopleNet – ResNet18

960x544

INT8

13.9

35

218

72

72

395

97

97

1086

3841

TrafficCamNet – ResNet18

960x544

INT8

19.5

52

264

105

105

478

140

140

1358

4013

DashCamNet – ResNet18

960x544

INT8

17.8

46

254

100

100

453

133

133

1320

3993

FaceDetectIR- ResNet18

384x240

INT8

101

275

1192

553

553

2010

754

754

2568

5549

All the models in the table above can run solely on DLA. This saves valuable GPU resources to run more complex models. TAO toolkit also supports training on popular Detection and Segmentation architectures. To learn more about how to train with TAO toolkit, refer to the TAO toolkit documentation. These models are natively integrated with DeepStream. These reference models are available to download from GitHub.

Performance - pretrained models- detection and segmentation

Jetson Nano

Jetson Xavier NX

Jetson AGX Xavier

T4

Model Arch

Inference resolution

Precision

GPU (FPS)

GPU (FPS)

DLA1 (FPS)

DLA2 (FPS)

GPU (FPS)

DLA1 (FPS)

DLA2 (FPS)

GPU (FPS)

YoloV3 – ResNet18

960x544

INT8

11

78

55

55

223

84

84

620

FasterRCNN – ResNet10

480x272

INT8

16

127

N/A

N/A

281

N/A

N/A

932

SSD – ResNet18

960x544

INT8

10.6

124

56

56

216

77

77

760

DSSD – ResNet18

960x544

INT8

9

66

45

45

189

67

67

586

RetinaNet – ResNet18

960x544

INT8

8.5

60

45

45

147

41

41

296

MaskRCNN – ResNet50

1344x832

INT8

0.6

5.4

3.2

3.2

9.2

4.5

4.5

24

Note

  • FasterRCNN model will not run efficiently on the DLA due to multiple layers not supported on the DLA.

  • All inferences on Jetson Nano is done using FP16 precision.

DeepStream reference model and tracker

DeepStream SDK ships with a reference DetectNet_v2-ResNet10 model and three ResNet18 classifier models. The detailed instructions to run these models with DeepStream are provided in the next section. The table below shows the performance of these models along with various trackers. DeepStream provides three reference trackers: IoU, KLT and NvDCF. For more information about trackers, See the Gst-nvtracker section.

Performance - Deepstream reference models

Jetson

Nano

Jetson

Tx2

Jetson

Xavier NX

Jetson

AGX

Xavier

T4

A100

PCIe

Model Arch

Tracker

Inference resolution

GPU

(FPS)

GPU

(FPS)

GPU

(FPS)

DLA1 (FPS)

DLA2 (FPS)

GPU

(FPS)

DLA1 (FPS)

DLA2 (FPS)

GPU

(FPS)

GPU

(FPS)

DetectNetV2 – ResNet10

No Tracker

480x272

92

241

1075

502

502

1815

692

692

2547

5541

DetectNetV2 – ResNet10

KLT

480x272

90

231

512

276

276

905

475

475

2236

1345

DetectNetV2 – ResNet10

NvDCF

480x272

48

160

522

265

265

921

460

460

2334

4819

DetectNetV2 – ResNet10 + ResNet18 classifiers

IoU

480x272 + 224x224

51

136

720

N/A

N/A

1399

N/A

N/A

2342

5534

DetectNetV2 – ResNet10 + ResNet18 classifiers

KLT

480x272 + 224x224

48

126

442

N/A

N/A

780

N/A

N/A

2290

1373

DetectNetV2 – ResNet10 + ResNet18 classifiers

NvDCF

480x272 + 224x224

34

99

415

N/A

N/A

766

N/A

N/A

2326

4829

Note

  • * - Performance bottleneck identified which will be fixed in future release.

  • All inferences are done using INT8 precision except on Jetson Nano™. On Nano, it is FP16.

  • Running inference simultaneously on multiple models is not supported on the DLA. You can only run one model at a time on the DLA.

To achieve the peak performance shown in the tables above, make sure the devices are properly cooled. For T4, make sure you use a server that meets the thermal and airflow requirements. Along with the hardware setup, a few other options in the config file need to be set to achieve the published performance. Make the required changes to one of the config files from DeepStream SDK to replicate the peak performance.

Turn off output rendering, OSD, and tiler

OSD (on-screen display) is used to display bounding box, masks, and labels on the screen. If output rendering is disabled, creating bounding boxes is not required unless the output needs to be streamed over RTSP or saved to disk. Tiler is used to display the output in NxM tiled grid. It is not needed if rendering is disabled. Output rendering, OSD and tiler use some percentage of compute resources, so it can reduce the inference performance.

To disable OSD, tiled display and output sink, make the following changes in the DeepStream config file.

  • To disable OSD, change enable to 0

    [osd]
    enable=0
    
  • To disable tiling, change enable to 0

    [tiled-display]
    enable=0
    
  • To turn-off output rendering, change the sink to fakesink.

    [sink0]
    enable=1
    #Type - 1=FakeSink 2=EglSink 3=File
    type=1
    sync=0
    

DeepStream reference model

Data center GPU - GA100

This section describes configuration and settings for the DeepStream SDK on NVIDIA Data center GPU - GA100.

System Configuration

The system configuration for the DeepStream SDK is listed below:

GA100 System configuration

System Configuration

Specification

CPU

AMD EPYC 7742@2.25GHz 3.4GHz Turbo (Rome) HT Off

GPU

A100-PCIE-40GB(GA100) 1*40537 MiB 1*108 SM

Ubuntu

Ubuntu 18.04

GPU Driver

460.32

CUDA

11.1

TensorRT

7.2.X+

GPU clock frequency

1410 MHz

Application Configuration

Config file: source4_1080p_dec_infer-resnet_tracker_sgie_tiled_display_int8.txt

Change the following items in the config file:

  • The inference resolution of Primary GIE is specified in the samples/models/Primary_detector/resnet10.prototxt.

  • Change the dim to 480x272.

  • Change batch size under streammux and primary-gie to match the number of streams.

  • Disable tiled display and rendering using instructions above.

  • Enable IoU tracker.

The application configuration for the DeepStream SDK is listed below:

GA100 application configuration

Application Configuration

Specification

N×1080p 30 fps stream

sample_1080p_h265.mp4 (provided with the SDK) N=64
sample_1080p_h264.mp4 (provided with the SDK) N=39

Primary GIE

  • Resnet10 (480×272)

  • Batch Size = N

  • Interval=0

Tracker

Enabled. Processing at 480×272 resolution, IOU tracker enabled.

3 × Secondary GIEs

All batches size 32. Asynchronous mode enabled.
  • Secondary_VehicleTypes (224×224—Resnet18)

  • Secondary_CarColor (224×224—Resnet18)

  • Secondary_CarMake (224×224—Resnet18)

Tiled Display

Disabled

Rendering

Disabled

Achieved Performance The table below shows the achieved performance of the DeepStream SDK under the specified system and application configuration:

Stream type

No. of Stream @ 30 FPS

CPU Utilization

GPU Utilization

H.265

158

4.5%

46.08%

H.264

91

2.83%

28.69%

Data center GPU - T4

This section describes configuration and settings for the DeepStream SDK on NVIDIA Data center GPU - T4.

System Configuration

The system configuration for the DeepStream SDK is listed below:

T4 System configuration

System Configuration

Specification

CPU

Dual Intel® Xeon® CPU E5-2650 v4 @ 2.20GHz (48 threads total)

GPU

Tesla T4*

System Memory

128 GB DDR4, 2400MHz

Ubuntu

Ubuntu 18.04

GPU Driver

460.32.X

CUDA

11.1

TensorRT

7.2.X

GPU clock frequency

1513 MHz

Application Configuration

Config file: source4_1080p_dec_infer-resnet_tracker_sgie_tiled_display_int8.txt

Change the following items in the config file:

  • The inference resolution of Primary GIE is specified in the samples/models/Primary_detector/resnet10.prototxt.

  • Change the dim to 480x272.

  • Change batch size under streammux and primary-gie to match the number of streams.

  • Disable tiled display and rendering using instructions above.

  • Enable IoU tracker.

The application configuration for the DeepStream SDK is listed below:

T4 application configuration

Application Configuration

Specification

N×1080p 30 fps stream

sample_1080p_h265.mp4 (provided with the SDK) N=64
sample_1080p_h264.mp4 (provided with the SDK) N=39

Primary GIE

  • Resnet10 (480×272)

  • Batch Size = N

  • Interval=0

Tracker

Enabled. Processing at 480×272 resolution, IOU tracker enabled.

3 × Secondary GIEs

All batches size 32. Asynchronous mode enabled.
  • Secondary_VehicleTypes (224×224—Resnet18)

  • Secondary_CarColor (224×224—Resnet18)

  • Secondary_CarMake (224×224—Resnet18)

Tiled Display

Disabled

Rendering

Disabled

Achieved Performance The table below shows the achieved performance of the DeepStream SDK under the specified system and application configuration:

Stream type

No. of Stream @ 30 FPS

CPU Utilization

GPU Utilization

H.265

64

8% to 10%

58%

H.264

39

5%

31%

Jetson

This section describes configuration and settings for the DeepStream SDK on NVIDIA Jetson™ platforms. JetPack 4.5.1 is used for software installation.

System Configuration

For the performance test:

  1. Max power mode is enabled: $ sudo nvpmodel -m 0

  2. The GPU clocks are stepped to maximum: $ sudo jetson_clocks

For information about supported power modes, see the “Supported Modes and Power Efficiency” section in the power management topics of NVIDIA Tegra Linux Driver Package Development Guide, e.g., “Power Management for Jetson AGX Xavier Devices.”

Jetson Nano

Config file: source8_1080p_dec_infer-resnet_tracker_tiled_display_fp16_nano.txt

Change the following items in the config file:

  • Change batch size under streammux and primary-gie to match the number of streams.

  • Disable tiled display and rendering using instructions above.

  • Enable KLT tracker and change the tracker resolution to 480x272.

The following tables describe performance results for the NVIDIA Jetson Nano.

Jetson Nano application configuration

Application Configuration

Specification

N×1080p 30 fps streams

sample_1080p_h265.mp4 (provided with the SDK) N = 8
sample_1080p_h264.mp4 (provided with the SDK) N = 8

Primary GIE

  • Resnet10 (480×272) Asynchronous mode enabled

  • Batch Size = N

  • Interval = 4

Tracker

Enabled; processing at 480×272 resolution, KLT tracker enabled

OSD/tiled display

Disabled

Renderer

Disabled

Achieved Performance

Stream type

No. of Stream @ 30 FPS

CPU Utilization

GPU Utilization

H.265

8

39%

67%

H.264

8

39%

65%

Jetson AGX Xavier

Config file: source4_1080p_dec_infer-resnet_tracker_sgie_tiled_display_int8.txt

Change the following items in the config file:

  • The inference resolution of Primary GIE is specified in the ‘samples/models/Primary_detector/resnet10.prototxt’.

  • Change the dim to 480x272.

  • Change batch size under streammux and primary-gie to match the number of streams.

  • Disable tiled display and rendering using instructions above.

  • Enable IOU tracker.

The following tables describe performance results for the NVIDIA Jetson AGX Xavier™.

Jetson Nano Pipeline Configuration (deepstream-app)

Application Configuration

Specification

N×1080p 30 fps streams

sample_1080p_h265.mp4 (provided with the SDK) N=45
sample_1080p_h264.mp4 (provided with the SDK) N=32

Primary GIE

  • Resnet10 (480×272) Asynchronous mode enabled

  • Batch Size = N

  • Interval = 0

Tracker

Enabled; processing at 480×272 resolution, IOU tracker enabled.

3× secondary GIEs

All batches are size 32.

  • Secondary_VehicleTypes (224×224—Resnet18)

  • Secondary_CarColor (224×224—Resnet18)

  • Secondary_CarMake (224×224—Resnet18)

OSD/tiled display

Disabled

Renderer

Disabled

Achieved Performance

Stream type

No. of Stream @ 30 FPS

CPU Utilization

GPU Utilization

H.265

45

22%

95%

H.264

32

19%

71%

Jetson NX

Config file: source4_1080p_dec_infer-resnet_tracker_sgie_tiled_display_int8.txt

Change the following items in the config file:

  • The inference resolution of Primary GIE is specified in the ‘samples/models/Primary_detector/resnet10.prototxt’.

  • Change the dim to 480x272.

  • Change batch size under streammux and primary-gie to match the number of streams.

  • Disable tiled display and rendering using instructions above.

  • Enable IOU tracker.

The following tables describe performance results for the NVIDIA Jetson NX™.

Jetson NX Pipeline Configuration (deepstream-app)

Application Configuration

Specification

N×1080p 30 fps streams

sample_1080p_h265.mp4 (provided with the SDK) N=23
sample_1080p_h264.mp4 (provided with the SDK) N=16

Primary GIE

  • Resnet10 (480×272) Asynchronous mode enabled

  • Batch Size = N

  • Interval = 0

Tracker

Enabled; processing at 480×272 resolution, IOU tracker enabled.

3× secondary GIEs

All batches are size 32.

  • Secondary_VehicleTypes (224×224—Resnet18)

  • Secondary_CarColor (224×224—Resnet18)

  • Secondary_CarMake (224×224—Resnet18)

OSD/tiled display

Disabled

Renderer

Disabled

Achieved Performance

Stream type

No. of Stream @ 30 FPS

CPU Utilization

GPU Utilization

H.265

23

55%

93%

H.264

16

45%

65%

Jetson TX2

Config file: source12_1080p_dec_infer-resnet_tracker_tiled_display_fp16_tx2.txt

Change the following in the config file:

  • Change batch size under streammux and primary-gie to match the number of streams.

  • Disable tiled display and rendering using instructions above.

  • Enable KLT tracker and change the tracker resolution to 480x272.

The following tables describe performance results for the Jetson™ TX2.

Jetson TX2 Pipeline Configuration (deepstream-app)

Application Configuration

Specification

N×1080p 30 fps streams

sample_1080p_h265.mp4 (provided with the SDK) N=15
sample_1080p_h264.mp4 (provided with the SDK) N=14

Primary GIE

  • Resnet10 (480×272) Asynchronous mode enabled

  • Batch Size = N

  • Interval = 4

Tracker

Enabled; processing at 480×272 resolution, KLT tracker enabled

OSD/tiled display

Disabled

Renderer

Disabled

Achieved Performance

Stream type

No. of Stream @ 30 FPS

CPU Utilization

GPU Utilization

H.265

15

35%

47%

H.264

14

34%

43%

Jetson TX1

Config file: source8_1080p_dec_infer-resnet_tracker_tiled_display_fp16_tx1.txt

Change the following in the config file:

  • Change batch size under streammux and primary-gie to match the number of streams.

  • Disable tiled display and rendering using instructions above.

  • Enable KLT tracker and change the tracker resolution to 480x272.

The following tables describe performance results for the Jetson™ TX1.

Jetson TX1 Pipeline Configuration (deepstream-app)

Application Configuration

Specification

N×1080p 30 fps streams

sample_1080p_h265.mp4 (provided with the SDK) N=13
sample_1080p_h264.mp4 (provided with the SDK) N=10

Primary GIE

  • Resnet10 (480×272) Asynchronous mode enabled

  • Batch Size = N

  • Interval = 4

Tracker

Enabled; processing at 480×272 resolution, KLT tracker enabled

OSD/tiled display

Disabled

Renderer

Disabled

Achieved Performance

Stream type

No. of Stream @ 30 FPS

CPU Utilization

GPU Utilization

H.265

13

56%

49%

H.264

10

43%

43%

Running applications using DLA

Jetson AGX Xavier and Jetson NX support 2 DLA engines. DeepStream does support inferencing using GPU and DLAs in parallel. You can do this in separate processes or single process. You will need three separate sets of configs configured to run on GPU, DLA0 and DLA1:

  • Separate processes:

When GPU and DLA are run in separate processes, set the environment variable CUDA_DEVICE_MAX_CONNECTIONS as 1 from the terminal where DLA config is running.

  • Single process:

DeepStream reference application supports multiple configs in the same process. To run DLA and GPU in same process, set environment variable CUDA_DEVICE_MAX_CONNECTIONS as 32: $ deepstream-app -c <gpuconfig> -c <dla0config> -c<dla1config>