Video Summarization Performance#
Overview#
The Video Summarization microservice processes uploaded video files through RTVI-VLM for per-chunk vision inference and a summarization LLM (CA-RAG) for event aggregation. The tables below report mean end-to-end (E2E) latency for file summarization and maximum concurrent request capacity (burst) across video lengths. Use these numbers to size GPU count and topology for batch summarization workloads on H100, RTX Pro 6000 SE, and L40S platforms.
Test Configuration#
Parameter |
Value |
|---|---|
VSS release |
3.2 |
VLM |
Cosmos Reason 2 8B (CR2-8B) |
Model precision |
FP8 |
Summarization LLM |
Nemotron 3 Nano (default Video Summarization stack) |
Chunk duration |
10 seconds |
Frames per chunk |
20 |
Vision input tokens |
9k |
Output max tokens |
100 |
Test scenario |
Warehouse safety monitoring |
Test videos |
1, 10, 30, 60, 120, and 720 minute MP4 |
GPU topologies |
|
Platforms tested |
H100, RTX Pro 6000 SE, L40S |
E2E Latency#
Topology ``4×4`` (4 VLM + 4 LLM GPUs)
Video length |
1 min |
10 min |
30 min |
60 min |
120 min |
|---|---|---|---|---|---|
E2E Avg (s) |
5.16 |
20.1 |
52.9 |
64.7 |
119 |
720-minute E2E Avg: 615 s
Topology ``2×2`` (2 VLM + 2 LLM GPUs)
Video length |
1 min |
10 min |
30 min |
60 min |
120 min |
|---|---|---|---|---|---|
E2E Avg (s) |
5.70 |
23.3 |
59.7 |
128 |
205 |
720-minute E2E Avg: 1151 s
Topology ``1`` (1 GPU — VLM and LLM colocated)
Video length |
1 min |
10 min |
30 min |
60 min |
120 min |
|---|---|---|---|---|---|
E2E Avg (s) |
12.1 |
50.7 |
117 |
265 |
438 |
Topology ``4×4`` (4 VLM + 4 LLM GPUs)
Video length |
1 min |
10 min |
30 min |
60 min |
120 min |
|---|---|---|---|---|---|
E2E Avg (s) |
6.50 |
25.9 |
51.6 |
95.2 |
187 |
720-minute E2E Avg: 935 s
Topology ``2×2`` (2 VLM + 2 LLM GPUs)
Video length |
1 min |
10 min |
30 min |
60 min |
120 min |
|---|---|---|---|---|---|
E2E Avg (s) |
7.05 |
33.0 |
91.6 |
184 |
334 |
720-minute E2E Avg: 1805 s
Topology ``1`` (1 GPU — VLM and LLM colocated)
Video length |
1 min |
10 min |
30 min |
60 min |
120 min |
|---|---|---|---|---|---|
E2E Avg (s) |
17.6 |
71.7 |
170 |
318 |
623 |
Topology ``4×4`` (4 VLM + 4 LLM GPUs)
Video length |
1 min |
10 min |
30 min |
60 min |
120 min |
|---|---|---|---|---|---|
E2E Avg (s) |
8.99 |
39.1 |
91.4 |
164 |
330 |
720-minute E2E Avg: 1890 s
Topology ``2×2`` (2 VLM + 2 LLM GPUs)
Video length |
1 min |
10 min |
30 min |
60 min |
120 min |
|---|---|---|---|---|---|
E2E Avg (s) |
12.2 |
65.8 |
168 |
340 |
609 |
720-minute E2E Avg: 3957 s
Maximum Concurrency#
Maximum concurrent summarize requests at each target mean E2E latency. Test videos are 1, 5, and 10 minute MP4.
Topology ``4×4`` (4 VLM + 4 LLM GPUs)
Video length |
1 min |
5 min |
10 min |
|---|---|---|---|
Target avg latency (s) |
60 |
300 |
600 |
Max concurrency |
73 |
114 |
125 |
Topology ``2×2`` (2 VLM + 2 LLM GPUs)
Video length |
1 min |
5 min |
10 min |
|---|---|---|---|
Target avg latency (s) |
60 |
300 |
600 |
Max concurrency |
42 |
65 |
72 |
Topology ``4×4`` (4 VLM + 4 LLM GPUs)
Video length |
1 min |
5 min |
10 min |
|---|---|---|---|
Target avg latency (s) |
60 |
300 |
600 |
Max concurrency |
41 |
71 |
79 |
Topology ``2×2`` (2 VLM + 2 LLM GPUs)
Video length |
1 min |
5 min |
10 min |
|---|---|---|---|
Target avg latency (s) |
60 |
300 |
600 |
Max concurrency |
22 |
38 |
41 |
Topology ``4×4`` (4 VLM + 4 LLM GPUs)
Video length |
1 min |
5 min |
10 min |
|---|---|---|---|
Target avg latency (s) |
60 |
300 |
600 |
Max concurrency |
15 |
36 |
41 |
Topology ``2×2`` (2 VLM + 2 LLM GPUs)
Video length |
1 min |
5 min |
10 min |
|---|---|---|---|
Target avg latency (s) |
60 |
300 |
600 |
Max concurrency |
8 |
17 |
20 |
Note
All measurements use CR2-8B FP8 with Nemotron 3 Nano, 20 frames per 10-second chunk, 9k vision
tokens, and max_tokens=100.