VIOS Performance#

Runtime profiling of the VIOS microservice in terms of number of RTSP/WebRTC streams.

Profiling of VIOS Component#

Description

GPU Utilization

CPU Memory

CPU Utilization

Comment

RTSP Streams Perf

Decoder: 11.33%, Encoder: 0%, GPU: 26%

10.21 GiB

28.41%

Number of Streams: 30 GPU usage will go up based on how many users are accessing the UI and using the overlay feature.

WebRTC Streams Perf

Decoder: 35%, Encoder: 0%, GPU: 41%

10.5 GiB

73.00%

Number of Streams: 30 GPU usage will go up based on how many users are accessing the UI and using the overlay feature.

Metrics obtained on the below system configuration:

  • GPU: NVIDIA A100

  • CPU: AMD EPYC 7313P @ 3GHz - 16 Cores


Streams per Replica/POD#

The table below summarizes the maximum number of concurrent sensor streams that a single Unified Stream Processing MS replica (pod) can support with the default VIOS configuration (out-of-the-box deployment, no custom tuning). All test streams use the same profile — H.264, 1080p, 30 FPS, 4 Mbps.

Streams Supported per Replica#

Test

GPU

Use Case

Max Streams per Replica/Pod

1

NVIDIA L40 (HW encoder via NVENC)

Recording × 70
RTSP × 70
Video Download × 70 (offset 100, with transcode)
Picture Download × 20
WebRTC × 8

70

2

NVIDIA H100 (SW encoder, no NVENC)

Recording × 20
RTSP × 20
Video Download × 20 (offset 100, with transcode)
Picture Download × 10
WebRTC × 8

20

Note

These results were measured on x86_64 hosts with the following specs:

  • Test 1 (L40): Intel Xeon Platinum 8362 @ 2.80 GHz (128 CPUs) and ~1 TB RAM.

  • Test 2 (H100): Intel Xeon Platinum 8480+ (224 CPUs) and ~2 TB RAM.

Stream counts may differ on other host architectures or on smaller / lower-spec hosts.


Per-Request Latency Benchmarks#

End-to-end request latency measured on an NVIDIA RTX PRO 6000 Blackwell GPU paired with an Intel(R) Xeon(R) Gold 6444Y CPU when the Unified Stream Processing MS serves video download and picture API requests over RTSP. Each scenario is measured with both H.264 and H.265 streams using a Very recent recency pattern, so codec impact can be read directly off each table. All latency values (Avg Latency, P50, P99, Max) are in milliseconds.

Single-Stream Video Download#

Latency for a single in-flight video-download request, measured for both H.264 and H.265 with and without server-side transcoding across nine (clip duration, offset) combinations.

Single-Stream Video Download (NVIDIA RTX PRO 6000 Blackwell)#

Codec

Clip Duration

Offset (ms)

Transcode

Avg Latency (ms)

P50 (ms)

P99 (ms)

Max (ms)

H.264

15s

100

without

183

182

188

193

H.265

15s

100

without

196

184

300

300

H.264

15s

500

without

36

37

40

41

H.265

15s

500

without

34

34

36

36

H.264

15s

1000

without

36

37

38

41

H.265

15s

1000

without

34

35

38

39

H.264

30s

100

without

184

180

185

216

H.265

30s

100

without

195

186

286

286

H.264

30s

500

without

65

46

249

249

H.265

30s

500

without

54

34

242

242

H.264

30s

1000

without

45

46

48

48

H.265

30s

1000

without

37

38

39

41

H.264

60s

100

without

177

178

187

187

H.265

60s

100

without

182

183

185

189

H.264

60s

500

without

74

73

79

80

H.265

60s

500

without

43

44

46

46

H.264

60s

1000

without

74

74

79

86

H.265

60s

1000

without

45

45

49

50

H.264

15s

100

with

466

479

489

491

H.265

15s

100

with

493

490

498

535

H.264

15s

500

with

474

479

488

494

H.265

15s

500

with

484

482

500

596

H.264

15s

1000

with

482

483

488

490

H.265

15s

1000

with

489

487

496

499

H.264

30s

100

with

493

480

601

601

H.265

30s

100

with

484

488

494

496

H.264

30s

500

with

479

476

486

490

H.265

30s

500

with

626

531

743

746

H.264

30s

1000

with

497

479

662

662

H.265

30s

1000

with

534

489

608

745

H.264

60s

100

with

960

974

986

999

H.265

60s

100

with

986

985

995

1002

H.264

60s

500

with

973

975

987

996

H.265

60s

500

with

983

987

995

996

H.264

60s

1000

with

974

972

984

986

H.265

60s

1000

with

817

753

999

1058

Single-Stream Picture API#

Latency for a single picture-API request, with and without overlay compositing on the returned frame.

Single-Stream Picture API (NVIDIA RTX PRO 6000 Blackwell)#

Codec

Offset (ms)

Overlay

Avg Latency (ms)

P50 (ms)

P99 (ms)

Max (ms)

H.264

100

without

404

399

463

579

H.265

100

without

524

501

603

606

H.264

500

without

193

194

200

203

H.265

500

without

252

271

303

313

H.264

1000

without

190

191

197

199

H.265

1000

without

233

218

305

308

H.264

100

with

399

398

453

456

H.265

100

with

400

401

497

497

H.264

500

with

206

207

215

216

H.265

500

with

251

221

318

322

H.264

1000

with

206

206

212

218

H.265

1000

with

237

233

266

333

Concurrent Video Download#

Latency under concurrent load for video-download requests, measured for both H.264 and H.265 across three concurrency levels and three offsets. Each clip is 15 seconds long.

Concurrent Video Download (NVIDIA RTX PRO 6000 Blackwell)#

Codec

Concurrency

Offset (ms)

Transcode

Avg Latency (ms)

P50 (ms)

P99 (ms)

Max (ms)

H.264

3

100

without

231

231

231

232

H.265

3

100

without

191

191

191

193

H.264

3

500

without

53

53

53

54

H.265

3

500

without

33

35

35

35

H.264

3

1000

without

50

51

51

52

H.265

3

1000

without

32

33

33

33

H.264

7

100

without

213

213

219

222

H.265

7

100

without

140

142

143

144

H.264

7

500

without

97

100

107

109

H.265

7

500

without

51

52

53

53

H.264

7

1000

without

96

98

99

101

H.265

7

1000

without

45

44

50

52

H.264

10

100

without

304

303

317

318

H.265

10

100

without

151

148

157

162

H.264

10

500

without

237

160

1042

1042

H.265

10

500

without

66

67

71

73

H.264

10

1000

without

156

158

169

172

H.265

10

1000

without

63

63

67

67

H.264

3

100

with

476

483

483

488

H.265

3

100

with

542

538

559

559

H.264

3

500

with

667

668

668

685

H.265

3

500

with

587

584

598

598

H.264

3

1000

with

653

660

660

665

H.265

3

1000

with

600

603

603

621

H.264

7

100

with

1082

1070

1211

1211

H.265

7

100

with

1202

1218

1224

1226

H.264

7

500

with

1094

1098

1112

1113

H.265

7

500

with

1252

1257

1267

1275

H.264

7

1000

with

1030

1032

1053

1055

H.265

7

1000

with

1303

1307

1359

1366

H.264

10

100

with

1597

1584

1634

1638

H.265

10

100

with

1832

1840

1897

1931

H.264

10

500

with

1726

1735

1744

1753

H.265

10

500

with

1699

1744

1784

1799

H.264

10

1000

with

1641

1642

1669

1669

H.265

10

1000

with

1776

1850

1882

1892

Concurrent Picture API#

Latency under concurrent load for picture-api requests, measured for both H.264 and H.265 across three concurrency levels and three offsets.

Concurrent Picture API (NVIDIA RTX PRO 6000 Blackwell)#

Codec

Concurrency

Offset (ms)

Overlay

Avg Latency (ms)

P50 (ms)

P99 (ms)

Max (ms)

H.264

3

100

without

634

679

679

693

H.265

3

100

without

702

703

703

720

H.264

3

500

without

426

420

466

466

H.265

3

500

without

468

477

477

497

H.264

3

1000

without

406

394

450

450

H.265

3

1000

without

504

497

550

550

H.264

7

100

without

972

1050

1141

1200

H.265

7

100

without

1249

1226

1301

1351

H.264

7

500

without

922

902

1032

1051

H.265

7

500

without

1076

1138

1163

1186

H.264

7

1000

without

896

916

1002

1047

H.265

7

1000

without

1037

1077

1198

1344

H.264

10

100

without

1473

1492

1647

1696

H.265

10

100

without

1812

1845

1980

2031

H.264

10

500

without

1326

1315

1459

1482

H.265

10

500

without

1584

1595

1772

1792

H.264

10

1000

without

1262

1264

1461

1496

H.265

10

1000

without

1670

1778

1866

1939

H.264

3

100

with

600

613

613

628

H.265

3

100

with

634

656

656

670

H.264

3

500

with

435

447

447

464

H.265

3

500

with

538

537

557

557

H.264

3

1000

with

449

449

449

467

H.265

3

1000

with

491

480

530

530

H.264

7

100

with

1102

1132

1194

1216

H.265

7

100

with

1180

1247

1321

1342

H.264

7

500

with

859

845

873

1246

H.265

7

500

with

1124

1168

1251

1316

H.264

7

1000

with

918

880

1073

1095

H.265

7

1000

with

982

1064

1127

1152

H.264

10

100

with

1437

1369

1652

1665

H.265

10

100

with

1610

1595

1979

1992

H.264

10

500

with

1295

1244

1543

1576

H.265

10

500

with

1694

1615

1950

1983

H.264

10

1000

with

1259

1309

1482

1520

H.265

10

1000

with

1496

1530

1711

1776

Key Observations#

  • Transcoding dominates single-stream video-download cost. With transcode disabled, latency is ~32-75 ms at offsets ≥ 500 ms; with transcode enabled it jumps to ~470-630 ms (15 s and 30 s clips) and ~820-1000 ms (60 s clips) — roughly an order of magnitude higher for the same workload.

  • Offset, not codec, drives non-transcode latency. Without transcode, the 100 ms-offset rows are ~3-5× slower (~177-196 ms) than the 500 ms / 1000 ms rows (~32-75 ms) for both codecs, because a smaller offset leaves less buffered data ready to serve. Beyond that offset effect, H.264 and H.265 are within tens of milliseconds of each other.

  • Transcode latency shows broad codec parity. With transcode enabled, H.264 and H.265 single-stream latencies are nearly identical at 15 s and 60 s clips (e.g., 60 s / 100 ms: 960 ms H.264 vs 986 ms H.265; 15 s / 100 ms: 466 ms vs 493 ms). At 30 s, H.265 runs modestly higher (e.g., 30 s / 500 ms: 626 ms H.265 vs 479 ms H.264), but neither codec carries a consistent transcode-cost advantage overall.

  • Transcode latency is roughly flat through 30 s, then steps up at 60 s. Both codecs stay around ~470-630 ms for 15 s and 30 s clips, then climb to ~820-1000 ms at 60 s — a step at the longest clip rather than smooth scaling with duration.

  • Picture API latency is tightly bounded. Single-stream picture-api stays within ~190-525 ms across both codecs and with/without overlay; overlay compositing adds only a few milliseconds. The 100 ms-offset rows (~399-524 ms) sit well above offsets ≥ 500 ms (~190-252 ms), and H.265 runs ~30-60 ms higher than H.264 at the larger offsets.

  • Concurrent video-download without transcode favors H.265. Under concurrent load H.265 is consistently faster than H.264 (e.g., 10 concurrent / 1000 ms: 63 ms H.265 vs 156 ms H.264). Scaling 3 → 10 concurrent at 1000 ms offset is sub- to near-linear: H.265 32 → 63 ms (~2.0×), H.264 50 → 156 ms (~3.1×) against the 3.3× concurrency increase.

  • Concurrent video-download with transcode scales ~3.4× and stays close to codec-neutral. At 100 ms offset, 3 → 10 concurrent goes 476 → 1597 ms for H.264 and 542 → 1832 ms for H.265 (~3.4× each). H.265 runs slightly heavier at 7-10 concurrent, with the gap staying within ~25% (e.g., 7 concurrent / 1000 ms: 1030 ms H.264 vs 1303 ms H.265).

  • Concurrent picture-api scales ~2-3× per 3.3× concurrency step and favors H.264. Across overlay modes, 3 → 10 concurrent goes from ~406-700 ms to ~1259-1812 ms (~2.3-2.6×). H.264 is modestly faster than H.265 under load (e.g., 10 concurrent / 100 ms without overlay: 1473 ms vs 1812 ms), and overlay overhead remains small.


How to Run Profiling Tools#

Use below tools from inside any VIOS container at path: /home/vst/vst_release/tools. These tools are based on x86_64 ubuntu22.04

  • To measure RTSP streams perf:

    # Specify RTSP streams or VIOS/NVStreamer endpoint to play & get FPS.
    
    #./testRTSPClient
    Usage: ./testRTSPClient --urls "<rtsp-url-1>, <rtsp-url-12>, ... <rtsp-url-N>"
            (where each <rtsp-url-i> is a "rtsp://" URL)
    
    =================== Options =========================
    --tcp                                => stream over tcp (Default is udp)
    --urls                               => Specify rtsp urls in "<rtsp-url-1>, <rtsp-url-12>, ... <rtsp-url-N>"
    --live555 or -l                      => use live555 stack instead of gst plugin
    --decode                             => Publish decoder fps instead of source fps. Applicable only in case of gst
    --port <port_number>                 => client start port number
    --fps                                => Display received fps of the stream
    --fps-interval <interval_in_seconds> => Publish fps at this interval (Default interval is 5sec)
    --csv-file <filePath to dump fps>    => Provide path & filename Eg. fps_report.csv
    --num-streams <Max num of streams>   => Max number of streams to be played (Default no limit)
    --socket-buffer-size                 => OS socket buffer size in bytes (needs to be root)
    --jitter-buffer-size                 => JitterBuffer/reordering buffer size in ms (default 200ms)
    --gst-frame-debug                    => Print every frame info in case of gst
    --vst-endpoint <ip:port>             => VIOS endpoint to fetch rtsp streams
    --vst-rtspserver-endpoint <ip:port>  => VIOS rtspserver endpoint to fetch rtsp streams
    
    ================== Examples =========================
    1. Play two RTSP streams
            ./testRTSPClient --urls "rtsp://10.0.0.1:8554/stream1,rtsp://10.0.0.1:8554/stream2"
    
    2. Play two rtsp streams, stream over tcp & use client start port number as 30000
            ./testRTSPClient --tcp --port 30000 --urls "rtsp://10.0.0.1:8554/stream1,rtsp://10.0.0.1:8554/stream2"
    
    3. Play two rtsp streams, specify Names
            ./testRTSPClient --urls "Amcrest_1|rtsp://10.0.0.1:8554/stream1, Amcrest_2|rtsp://10.0.0.1:8554/stream2"
    
    4. Play two rtsp streams & log fps on console for every 5second
            ./testRTSPClient --fps --fps-inteval 5 --urls "Amcrest_1|rtsp://10.0.0.1:8554/stream1, Amcrest_2|rtsp://10.0.0.1:8554/stream2"
    
    5. Play two rtsp streams & dump fps data in given csv file
            ./testRTSPClient --fps --csv-file /home/vst/fps_report.csv --urls "Amcrest_1|rtsp://10.0.0.1:8554/stream1, Amcrest_2|rtsp://10.0.0.1:8554/stream2"
    
    6. Play all rtsp streams from given vst endpoint & dump fps in csv file
            ./testRTSPClient --fps --csv-file /home/vst/fps_report.csv --vst-endpoint 10.0.0.1:30000
    
    7. Play 4 rtsp streams from given vst endpoint & dump fps in csv file
            ./testRTSPClient --fps --csv-file /home/vst/fps_report.csv --num-streams 4 --vst-endpoint 10.0.0.1:30000
    
    8. Play 4 rtsp streams from given vst rtsp-server endpoint & dump fps in csv file
        ./testRTSPClient --fps --csv-file /home/vst/fps_report.csv --num-streams 4 --vst-rtspserver-endpoint 10.0.0.1:31000
    
  • To measure WebRTC streams perf:

    # specify VIOS/NVStreamer endpoint to play the WebRTC streams
      ./testWebrtcClient --duration 600 --vst-endpoint 10.0.0.1:30000 --fps-interval 10 --num-streams 4 --csv-file ./fps_file.csv