Performance#
Request Type Selection#
Choose the right request type for your workload:
request_type: "query": Best for the following:Single items (1 items)
Low-latency applications requiring immediate responses
Interactive applications
Sending video data as part of the request
request_type: "bulk_text": Best for the following:Large batches of text-only inputs (up to 64 strings)
High-throughput text processing
When you have many text captions to process at once
request_type: "bulk_video": Best for the following:Large batches of video-only inputs (up to 64 videos)
Maximum throughput for video processing
Batch processing workflows
Bulk Video Throughput Performance#
The following benchmarks were conducted using a pipelined version with load-balanced queues, where decoding (pyNvVideoCodec) runs in parallel with CV-CUDA pre-processing and inference (TRT fp16). This setup utilizes best-effort load balancing, MPS (Multi-Process Service), and IPC (Inter-Process Communication). The processes involved are 2 downloaders, 11 decoders, 2 preprocessors, and 2 embedders.
A100 GPU:
Configuration: 15-second, 1080p video in
bulk_videomode with a request size of 64.Throughput: 16 videos/second
H100 GPU:
Configuration: 15-second, 1080p video in
bulk_videomode with a request size of 64.Throughput: 28.57 videos/second
Additional GPUs#
Expect proportional scaling based on decode and compute capacity across supported GPUs (e.g. L40s, L4, H20, L20). For GPUs with less memory, reduce batch/request sizes to avoid memory pressure.
Latency Performance#
The following table shows representative single-embedding latency (seconds) per
request_type=query for text and video, measured on MSRVTT clips.
GPU  | 
Video (s)  | 
Text (s)  | 
|---|---|---|
NVIDIA H100 80GB HBM3  | 
0.0431  | 
0.0058  | 
NVIDIA H100 NVL  | 
0.0433  | 
0.0055  | 
NVIDIA H100 PCIe  | 
0.0446  | 
0.0058  | 
NVIDIA L40S  | 
0.0511  | 
0.0061  | 
NVIDIA A100 80GB PCIe  | 
0.0563  | 
0.0064  | 
NVIDIA A100-SXM4-80GB  | 
0.0613  | 
0.0069  | 
NVIDIA A100-SXM4-40GB  | 
0.0672  | 
0.0074  | 
NVIDIA A100-PCIE-40GB  | 
0.0683  | 
0.0093  | 
NVIDIA H20  | 
0.0705  | 
0.0077  | 
NVIDIA L4  | 
0.1096  | 
0.0085  | 
NVIDIA L20  | 
0.0738  | 
0.0064  | 
NVIDIA RTX 6000 Ada  | 
0.0527  | 
0.0067  | 
Resource Management#
Memory Optimization
Monitor GPU memory usage with larger batch sizes.
The service automatically scales workers based on available resources.
Use
CUDA_VISIBLE_DEVICESto control GPU allocation.
Concurrency
The service handles multiple concurrent requests efficiently.
For sustained high throughput, maintain a consistent request rate.
Consider connection pooling for client applications.
Error Handling
The service includes automatic retry logic for transient failures.
Monitor the
X-Restart-Countheader for service health.Implement exponential backoff in client code for failed requests.