Performance#

Request Type Selection#

Choose the right request type for your workload:

  • request_type: "query": Best for the following:

    • Single items (1 items)

    • Low-latency applications requiring immediate responses

    • Interactive applications

    • Sending video data as part of the request

  • request_type: "bulk_text": Best for the following:

    • Large batches of text-only inputs (up to 64 strings)

    • High-throughput text processing

    • When you have many text captions to process at once

  • request_type: "bulk_video": Best for the following:

    • Large batches of video-only inputs (up to 64 videos)

    • Maximum throughput for video processing

    • Batch processing workflows

Bulk Video Throughput Performance#

The following benchmarks were conducted using a pipelined version with load-balanced queues, where decoding (pyNvVideoCodec) runs in parallel with CV-CUDA pre-processing and inference (TRT fp16). This setup utilizes best-effort load balancing, MPS (Multi-Process Service), and IPC (Inter-Process Communication). The processes involved are 2 downloaders, 11 decoders, 2 preprocessors, and 2 embedders.

  • A100 GPU:

    • Configuration: 15-second, 1080p video in bulk_video mode with a request size of 64.

    • Throughput: 16 videos/second

  • H100 GPU:

    • Configuration: 15-second, 1080p video in bulk_video mode with a request size of 64.

    • Throughput: 28.57 videos/second

Additional GPUs#

  • Expect proportional scaling based on decode and compute capacity across supported GPUs (e.g. L40s, L4, H20, L20). For GPUs with less memory, reduce batch/request sizes to avoid memory pressure.

Latency Performance#

The following table shows representative single-embedding latency (seconds) per request_type=query for text and video, measured on MSRVTT clips.

GPU

Video (s)

Text (s)

NVIDIA H100 80GB HBM3

0.0431

0.0058

NVIDIA H100 NVL

0.0433

0.0055

NVIDIA H100 PCIe

0.0446

0.0058

NVIDIA L40S

0.0511

0.0061

NVIDIA A100 80GB PCIe

0.0563

0.0064

NVIDIA A100-SXM4-80GB

0.0613

0.0069

NVIDIA A100-SXM4-40GB

0.0672

0.0074

NVIDIA A100-PCIE-40GB

0.0683

0.0093

NVIDIA H20

0.0705

0.0077

NVIDIA L4

0.1096

0.0085

NVIDIA L20

0.0738

0.0064

NVIDIA RTX 6000 Ada

0.0527

0.0067

Resource Management#

Memory Optimization

  • Monitor GPU memory usage with larger batch sizes.

  • The service automatically scales workers based on available resources.

  • Use CUDA_VISIBLE_DEVICES to control GPU allocation.

Concurrency

  • The service handles multiple concurrent requests efficiently.

  • For sustained high throughput, maintain a consistent request rate.

  • Consider connection pooling for client applications.

Error Handling

  • The service includes automatic retry logic for transient failures.

  • Monitor the X-Restart-Count header for service health.

  • Implement exponential backoff in client code for failed requests.