RT-VLM Performance#

Overview#

The Real-Time VLM (RT-VLM) microservice provides real-time video understanding capabilities using Vision Language Models (VLM) for live RTSP streams and pre-recorded video files, generating captions and detecting incidents. It uses the Cosmos Reason 2 (CR2-8B) model served via vLLM.

Video is processed in 10-second chunks: the microservice accumulates 80 frames per chunk at 448×448 resolution, producing 7,840 vision tokens per inference call. Benchmarks cover two operating modes and two output sequence lengths (OSL):

Streaming mode — the microservice reads live RTSP streams continuously. Latency is measured from when a chunk is ready until the VLM response is received.
File processing mode — pre-recorded video files are processed as fast as possible since all frames are immediately available.

Output sequence length (OSL) determines the use case:

OSL = 1 (Alerting) — the VLM outputs a single Yes/No token in response to a binary alert prompt. Lower latency and higher throughput.
OSL = 100 (Captioning) — the VLM generates a descriptive caption of the video chunk. Higher output token count leads to longer latency and lower throughput.

RT-VLM max concurrent streams per GPU for alerting and captioning use cases — Max concurrent live RTSP streams per GPU. Alerting (OSL=1) consistently supports more concurrent streams than captioning (OSL=100) due to its single-token output. H100 and RTX Pro 6000 SE are the recommended platforms for multi-stream deployments.#

Test Configuration#

Parameter	Value
VSS Release	3.1
Model	Cosmos Reason 2 (CR2-8B)
Model precision	FP8
Inference engine	vLLM
Chunk duration	10 seconds
Frames per chunk	80
Image resolution	448×448
Vision tokens per chunk	7,840
Input sequence length (ISL)	56 text tokens
GPUs tested	H100, RTX Pro 6000 SE, DGX Spark, AGX Thor

Performance by GPU#

H100

Streaming Mode

Max Concurrent Streams

Use Case	Max Streams	Chunk E2E Avg (s)	p90 (s)	p95 (s)	GPU Core (%)	GPU Mem (%)
Alerting (OSL=1)	51	3.6	4.86	4.95	94.8	86.8
Captioning (OSL=100)	33	4.78	6.31	6.44	83.8	83.9

Chunk Latency vs. Concurrent Streams

Use Case	Concurrent Streams	Chunk E2E Avg (s)	GPU Core Avg (%)
Alerting (OSL=1)	1	0.58	0.4
Alerting (OSL=1)	10	1.10	21.0
Alerting (OSL=1)	20	1.33	30.8
Captioning (OSL=100)	1	1.23	8.2
Captioning (OSL=100)	10	2.21	31.2
Captioning (OSL=100)	20	3.72	52.2

File Processing Mode

Video File Latency (Concurrency = 1)

Use Case	Video Length	E2E Latency (s)
Alerting (OSL=1)	10 s	0.67
Alerting (OSL=1)	10 min	18.1
Alerting (OSL=1)	60 min	108.0
Captioning (OSL=100)	10 s	1.31
Captioning (OSL=100)	10 min	22.0
Captioning (OSL=100)	60 min	131.1

File Processing Throughput — Alerting (OSL = 1)

Concurrency	E2E Latency Avg (s)	Throughput (req/s)	p90 (s)	p95 (s)
1	0.61	1.64	0.74	0.74
2	0.60	3.33	0.75	0.75
4	0.67	5.97	0.80	0.81
8	0.72	11.11	1.07	1.14
16	1.12	14.29	1.75	1.76
32	1.58	20.25	2.38	2.44

File Processing Throughput — Captioning (OSL = 100)

Concurrency	E2E Latency Avg (s)	Throughput (req/s)	p90 (s)	p95 (s)
1	1.13	0.88	1.35	1.35
2	1.19	1.68	1.43	1.43
4	1.25	3.20	1.53	1.54
8	1.69	4.73	1.67	1.68
16	1.92	8.33	2.33	2.34
32	2.65	12.08	3.69	3.71

RTX Pro 6000 SE

Streaming Mode

Max Concurrent Streams

Use Case	Max Streams	Chunk E2E Avg (s)	p90 (s)	p95 (s)	GPU Core (%)	GPU Mem (%)
Alerting (OSL=1)	33	3.4	4.61	4.61	94.0	81.1
Captioning (OSL=100)	18	6.4	8.02	8.46	93.6	78.8

Chunk Latency vs. Concurrent Streams

Use Case	Concurrent Streams	Chunk E2E Avg (s)	GPU Core Avg (%)
Alerting (OSL=1)	1	0.73	6.2
Alerting (OSL=1)	10	1.48	32.1
Alerting (OSL=1)	20	2.07	56.0
Captioning (OSL=100)	1	2.23	17.5
Captioning (OSL=100)	10	4.69	54.3
Captioning (OSL=100)	20	9.42	74.2

File Processing Mode

Video File Latency (Concurrency = 1)

Use Case	Video Length	E2E Latency (s)
Alerting (OSL=1)	10 s	0.44
Alerting (OSL=1)	10 min	27.6
Alerting (OSL=1)	60 min	164.5
Captioning (OSL=100)	10 s	1.93
Captioning (OSL=100)	10 min	35.8
Captioning (OSL=100)	60 min	219.0

File Processing Throughput — Alerting (OSL = 1)

Concurrency	E2E Latency Avg (s)	Throughput (req/s)	p90 (s)	p95 (s)
1	0.46	2.17	0.52	0.52
2	0.37	5.41	0.54	0.54
4	0.46	8.70	0.80	0.82
8	0.41	19.51	0.52	0.53
16	0.61	26.23	1.02	1.03
32	1.10	29.09	1.50	1.53

File Processing Throughput — Captioning (OSL = 100)

Concurrency	E2E Latency Avg (s)	Throughput (req/s)	p90 (s)	p95 (s)
1	1.76	0.57	1.98	1.98
2	1.88	1.06	2.09	2.09
4	1.98	2.02	1.95	1.96
8	2.22	3.60	2.43	2.43
16	2.77	5.78	3.01	3.07
32	3.92	8.16	4.32	4.34

DGX Spark

Streaming Mode

Max Concurrent Streams

Use Case	Max Streams	Chunk E2E Avg (s)	p90 (s)	p95 (s)	GPU Core (%)
Alerting (OSL=1)	2	4.08	4.24	4.24	90.9
Captioning (OSL=100)	1	8.61	8.64	8.65	89.2

Chunk Latency vs. Concurrent Streams

Use Case	Concurrent Streams	Chunk E2E Avg (s)
Alerting (OSL=1)	1	4.03
Alerting (OSL=1)	2	6.92
Captioning (OSL=100)	1	8.62
Captioning (OSL=100)	2	27.14

File Processing Mode

Video File Latency (Concurrency = 1)

Use Case	Video Length	E2E Latency (s)
Alerting (OSL=1)	10 s	1.63
Alerting (OSL=1)	10 min	176.3
Captioning (OSL=100)	10 s	6.04
Captioning (OSL=100)	10 min	218.1

File Processing Throughput — Alerting (OSL = 1)

Concurrency	E2E Latency Avg (s)	Throughput (req/s)
1	1.63	0.61
2	1.15	1.74
4	1.31	3.05
8	1.96	4.08

File Processing Throughput — Captioning (OSL = 100)

Concurrency	E2E Latency Avg (s)	Throughput (req/s)
1	6.04	0.17
2	5.28	0.38
4	7.37	0.54
8	10.36	0.77

AGX Thor

Streaming Mode

Max Concurrent Streams

Use Case	Max Streams	Chunk E2E Avg (s)	p90 (s)	p95 (s)	GPU Core (%)
Alerting (OSL=1)	3	6.52	8.12	8.14	81.1
Captioning (OSL=100)	1	7.62	7.65	7.65	76.6

Chunk Latency vs. Concurrent Streams

Use Case	Concurrent Streams	Chunk E2E Avg (s)
Alerting (OSL=1)	1	3.22
Alerting (OSL=1)	2	5.39
Captioning (OSL=100)	1	7.19
Captioning (OSL=100)	2	15.70

File Processing Mode

Video File Latency (Concurrency = 1)

Use Case	Video Length	E2E Latency (s)
Alerting (OSL=1)	10 s	0.37
Alerting (OSL=1)	10 min	179.6
Captioning (OSL=100)	10 s	6.72
Captioning (OSL=100)	10 min	240.9

File Processing Throughput — Alerting (OSL = 1)

Concurrency	E2E Latency Avg (s)	Throughput (req/s)
1	0.37	2.70
2	0.47	4.26
4	0.72	5.56
8	0.99	8.08

File Processing Throughput — Captioning (OSL = 100)

Concurrency	E2E Latency Avg (s)	Throughput (req/s)
1	6.72	0.15
2	7.34	0.27
4	9.40	0.43
8	9.22	0.87

Note

All benchmarks use CR2-8B, FP8, vLLM, 10-second chunks, 80 frames per chunk, 448×448 resolution, 7,840 vision tokens, and ISL=56 text tokens. For streaming deployments, plan for 10–15% headroom below the maximum concurrent stream counts. GPU Memory Utilization is not available for DGX Spark and AGX Thor. p90/p95 latency is not available for DGX Spark and AGX Thor in file processing mode.