Performance#
Request Type Selection#
Choose the right request type for your workload:
request_type: "query"
: Best for the following:Single items (1 items)
Low-latency applications requiring immediate responses
Interactive applications
Sending video data as part of the request
request_type: "bulk_text"
: Best for the following:Large batches of text-only inputs (up to 64 strings)
High-throughput text processing
When you have many text captions to process at once
request_type: "bulk_video"
: Best for the following:Large batches of video-only inputs (up to 64 videos)
Maximum throughput for video processing
Batch processing workflows
Bulk Video Throughput Performance#
The following benchmarks were conducted using a pipelined version with load-balanced queues, where decoding (pyNvVideoCodec) runs in parallel with CV-CUDA pre-processing and inference (TRT fp16). This setup utilizes best-effort load balancing, MPS (Multi-Process Service), and IPC (Inter-Process Communication). The processes involved are 2 downloaders, 11 decoders, 2 preprocessors, and 2 embedders.
A100 GPU:
Configuration: 15-second, 1080p video in
bulk_video
mode with a request size of 64.Throughput: 16 videos/second
H100 GPU:
Configuration: 15-second, 1080p video in
bulk_video
mode with a request size of 64.Throughput: 28.57 videos/second
Additional GPUs#
Expect proportional scaling based on decode and compute capacity across supported GPUs (e.g. L40s, L4, H20, L20). For GPUs with less memory, reduce batch/request sizes to avoid memory pressure.
Latency Performance#
The following table shows representative single-embedding latency (seconds) per
request_type=query
for text and video, measured on MSRVTT clips.
GPU |
Video (s) |
Text (s) |
---|---|---|
NVIDIA H100 80GB HBM3 |
0.0431 |
0.0058 |
NVIDIA H100 NVL |
0.0433 |
0.0055 |
NVIDIA H100 PCIe |
0.0446 |
0.0058 |
NVIDIA L40S |
0.0511 |
0.0061 |
NVIDIA A100 80GB PCIe |
0.0563 |
0.0064 |
NVIDIA A100-SXM4-80GB |
0.0613 |
0.0069 |
NVIDIA A100-SXM4-40GB |
0.0672 |
0.0074 |
NVIDIA A100-PCIE-40GB |
0.0683 |
0.0093 |
NVIDIA H20 |
0.0705 |
0.0077 |
NVIDIA L4 |
0.1096 |
0.0085 |
NVIDIA L20 |
0.0738 |
0.0064 |
NVIDIA RTX 6000 Ada |
0.0527 |
0.0067 |
Resource Management#
Memory Optimization
Monitor GPU memory usage with larger batch sizes.
The service automatically scales workers based on available resources.
Use
CUDA_VISIBLE_DEVICES
to control GPU allocation.
Concurrency
The service handles multiple concurrent requests efficiently.
For sustained high throughput, maintain a consistent request rate.
Consider connection pooling for client applications.
Error Handling
The service includes automatic retry logic for transient failures.
Monitor the
X-Restart-Count
header for service health.Implement exponential backoff in client code for failed requests.