RT-CV Performance#
Overview#
The Real-Time Computer Vision (RT-CV) microservice leverages the NVIDIA DeepStream SDK to perform continuous object detection and multi-object tracking on live RTSP streams. Benchmarks measure the maximum number of concurrent 1080p streams a single GPU can sustain at 30 FPS with object detection and tracker enabled.
Max concurrent 1080p streams per GPU. RT-DETR Resnet50 provides the highest stream count on server GPUs (H100, RTX Pro 6000 SE); edge platforms (DGX Spark, AGX Thor) are best suited for lighter-weight deployments.#
Test Configuration#
Parameter |
Value |
|---|---|
VSS Release |
3.1 |
Video resolution |
1920×1080 (1080p) |
Input format |
H.264 |
Configured stream FPS |
30 |
Model precision |
FP16 |
Inference engine |
TensorRT (TRT) |
Tracker |
Enabled |
Models tested |
RT-DETR (Resnet50 backbone), RT-DETR (EfficientViT/L2 backbone), Grounding DINO (GDINO) |
GPUs tested |
H100, RTX Pro 6000 SE, L40S, DGX Spark, AGX Thor |
Performance by GPU#
Model |
Backbone |
Max Streams |
Avg Latency (ms) |
p90 (ms) |
p95 (ms) |
GPU Core (%) |
CPU Core (%) |
|---|---|---|---|---|---|---|---|
RT-DETR |
Resnet50 |
50 |
408.46 |
858.32 |
901.02 |
92.1 |
2.6 |
RT-DETR |
EfficientViT/L2 |
11 |
56.0 |
67.34 |
69.11 |
88.4 |
1.1 |
Grounding DINO |
— |
6 |
61.47 |
75.25 |
75.61 |
90.0 |
0.9 |
Model |
Backbone |
Max Streams |
Avg Latency (ms) |
p90 (ms) |
p95 (ms) |
GPU Core (%) |
CPU Core (%) |
|---|---|---|---|---|---|---|---|
RT-DETR |
Resnet50 |
29 |
196.59 |
258.85 |
327.31 |
90.6 |
1.3 |
RT-DETR |
EfficientViT/L2 |
17 |
62.02 |
73.18 |
74.84 |
87.0 |
0.8 |
Grounding DINO |
— |
5 |
61.08 |
70.09 |
72.43 |
86.5 |
0.6 |
Model |
Backbone |
Max Streams |
Avg Latency (ms) |
p90 (ms) |
p95 (ms) |
GPU Core (%) |
CPU Core (%) |
|---|---|---|---|---|---|---|---|
RT-DETR |
Resnet50 |
15 |
65.64 |
72.97 |
73.72 |
87.3 |
0.9 |
RT-DETR |
EfficientViT/L2 |
4 |
53.14 |
66.07 |
70.5 |
88.9 |
0.6 |
Grounding DINO |
— |
3 |
52.35 |
64.27 |
68.04 |
84.6 |
0.5 |
Model |
Backbone |
Max Streams |
Avg Latency (ms) |
p90 (ms) |
p95 (ms) |
GPU Core (%) |
CPU Core (%) |
|---|---|---|---|---|---|---|---|
RT-DETR |
Resnet50 |
5 |
171.91 |
206.27 |
221.29 |
95.5 |
21.9 |
RT-DETR |
EfficientViT/L2 |
3 |
116.67 |
127.64 |
128.9 |
95.0 |
19.4 |
Grounding DINO* |
— |
1 |
26.71 |
44.37 |
45.07 |
54.3 |
13.7 |
Model |
Backbone |
Max Streams |
Avg Latency (ms) |
p90 (ms) |
p95 (ms) |
GPU Core (%) |
CPU Core (%) |
|---|---|---|---|---|---|---|---|
RT-DETR |
Resnet50 |
4 |
56.39 |
68.25 |
72.33 |
59.8 |
22.3 |
RT-DETR |
EfficientViT/L2 |
3 |
60.31 |
76.68 |
78.48 |
88.1 |
20.6 |
Grounding DINO* |
— |
1 |
43.23 |
62.2 |
64.0 |
67.7 |
17.6 |
Note
All benchmarks were measured at 30 FPS, 1080p, H.264 input, FP16 precision with TensorRT, and object tracker enabled. For production deployments, plan for 10–15% headroom below the maximum stream counts listed above.
Note
* Grounding DINO on DGX Spark and AGX Thor is run with interval=1 — inference is
performed on every alternate frame to meet the reported stream counts.