RT-Embed Performance#

Overview#

The Real-Time Embedding (RT-Embedding) microservice provides video and text embedding generation capabilities using the Cosmos-Embed1 model, enabling video search and similarity matching. Benchmarks cover two operating modes:

Streaming mode — the microservice reads live RTSP streams continuously. Every 10 seconds of video, it samples 8 frames and produces one video embedding vector. Latency is measured from when a 10-second chunk is ready until the embedding is returned.
Non-streaming mode — the microservice handles discrete embedding requests for video clips or text queries. Benchmarks measure throughput (requests per second) and latency across varying concurrency levels.

Each video embedding encodes 8 frames sampled from a 10-second chunk scaled to 448×448 resolution into a 768-dimensional vector. Text embeddings encode up to 128 tokens into the same 768-dimensional space, enabling cross-modal search.

RT-Embedding video embedding throughput vs concurrency for H100 and RTX Pro 6000 SE — Video embedding throughput (req/s) vs. concurrent requests. The H100 reaches up to 16 req/s while the RTX Pro 6000 SE peaks near 15 req/s; the RTX Pro 6000 SE reaches peak throughput at lower concurrency and saturates earlier, while the H100 sustains throughput through higher concurrency levels.#

Test Configuration#

Parameter	Value
VSS Release	3.1
Model	Cosmos-Embed1-448p
Model precision	FP16
Inference engine	TensorRT (TRT)
Frames per chunk (video)	8
Image resolution	448×448
Embedding vector length	768
Chunk duration	10 seconds
Text ISL (text embedding)	20 tokens
GPUs tested	H100, RTX Pro 6000 SE

Performance by GPU#

H100

Streaming Mode

Max Concurrent Streams

Max Streams	Avg Latency (s)	p90 (s)	p95 (s)	Max (s)	GPU Core (%)	NVdec (%)
170	3.00	6.76	8.62	12.89	66.3	93.9

Stream Latency Profile

Concurrent Streams	Avg (s)	p90 (s)	p95 (s)	Min (s)	Max (s)	GPU Core (%)	NVdec (%)
1	0.13	0.13	0.19	0.08	0.32	0.0	0.6
16	0.20	0.20	0.25	0.08	0.29	4.0	6.4
32	0.24	0.24	0.31	0.08	0.34	8.0	11.7

Request Mode

Video Embedding Throughput

Concurrent Requests	Avg Latency (s)	Throughput (req/s)
1	0.41	2.41
2	0.42	4.68
4	0.44	8.74
8	0.50	14.79
16	0.94	15.74
32	1.81	16.28

Text Embedding Throughput (ISL = 20 tokens)

Concurrent Requests	Avg Latency (s)	Throughput (req/s)	p90 (s)	p95 (s)
1	0.007	132.1	0.007	0.008
2	0.008	216.7	0.009	0.011
4	0.010	368.0	0.010	0.012
8	0.013	563.6	0.015	0.016
16	0.018	800.5	0.023	0.023
32	0.027	1044.3	0.031	0.031

Video File Embedding Latency (Concurrency = 1)

Video Duration	E2E Latency (s)	Avg Chunk Inference (s)	Decode Latency (s)	GPU Core (%)	NVdec (%)
10 s	0.412	0.039	0.369	0.0	0.0
10 min	5.203	0.039	0.615	48.9	72.6
60 min	28.8	0.039	0.601	54.3	87.6

RTX Pro 6000 SE