Performance#

Request Type Selection#

Choose the right request type for your workload:

request_type: "query": Best for the following:
- Single items (1 items)
- Low-latency applications requiring immediate responses
- Interactive applications
- Sending video data as part of the request
request_type: "bulk_text": Best for the following:
- Large batches of text-only inputs (up to 64 strings)
- High-throughput text processing
- When you have many text captions to process at once
request_type: "bulk_video": Best for the following:
- Large batches of video-only inputs (up to 64 videos)
- Maximum throughput for video processing
- Batch processing workflows

Bulk Video Throughput Performance#

The following benchmarks were conducted using a pipelined version with load-balanced queues, where decoding (pyNvVideoCodec) runs in parallel with CV-CUDA pre-processing and inference (TRT fp16). This setup utilizes best-effort load balancing, MPS (Multi-Process Service), and IPC (Inter-Process Communication). The processes involved are 2 downloaders, 11 decoders, 2 preprocessors, and 2 embedders.

A100 GPU:
- Configuration: 15-second, 1080p video in bulk_video mode with a request size of 64.
- Throughput: 16 videos/second
H100 GPU:
- Configuration: 15-second, 1080p video in bulk_video mode with a request size of 64.
- Throughput: 28.57 videos/second

Additional GPUs#

Expect proportional scaling based on decode and compute capacity across supported GPUs (e.g. L40s, L4, H20, L20). For GPUs with less memory, reduce batch/request sizes to avoid memory pressure.

Latency Performance#

The following table shows representative single-embedding latency (seconds) per request_type=query for text and video, measured on MSRVTT clips.

GPU	Video (s)	Text (s)
NVIDIA H100 80GB HBM3	0.0431	0.0058
NVIDIA H100 NVL	0.0433	0.0055
NVIDIA H100 PCIe	0.0446	0.0058
NVIDIA L40S	0.0511	0.0061
NVIDIA A100 80GB PCIe	0.0563	0.0064
NVIDIA A100-SXM4-80GB	0.0613	0.0069
NVIDIA A100-SXM4-40GB	0.0672	0.0074
NVIDIA A100-PCIE-40GB	0.0683	0.0093
NVIDIA H20	0.0705	0.0077
NVIDIA L4	0.1096	0.0085
NVIDIA L20	0.0738	0.0064
NVIDIA RTX 6000 Ada	0.0527	0.0067

Resource Management#

Memory Optimization

Monitor GPU memory usage with larger batch sizes.
The service automatically scales workers based on available resources.
Use CUDA_VISIBLE_DEVICES to control GPU allocation.

Concurrency

The service handles multiple concurrent requests efficiently.
For sustained high throughput, maintain a consistent request rate.
Consider connection pooling for client applications.

Error Handling

The service includes automatic retry logic for transient failures.
Monitor the X-Restart-Count header for service health.
Implement exponential backoff in client code for failed requests.