Performance Benchmarks#

Overview#

This document provides comprehensive performance benchmarks for the Cosmos Dataset Search (CDS) system, covering three key performance areas:

Bulk Ingestion Performance: GPU-accelerated embedding ingestion using Milvus GPU_CAGRA indexing
Video Ingestion Performance: End-to-end video processing and embedding generation
Text Search Latency Performance: Query response times for semantic search

All benchmarks were conducted on NVIDIA L40 GPU hardware with the optimized Milvus configuration from /deploy/standalone/milvus_l40_standalone_optimized.yaml.

To apply this configuration, ensure it is mounted in /deploy/standalone/docker-compose.build.yml under the milvus service (using a volumes entry):

services:
  milvus:
    # ...
    volumes:
      - ./deploy/standalone/milvus_l40_standalone_optimized.yaml:/milvus/configs/milvus.yaml

This ensures the Milvus instance uses the performance-optimized configuration for the benchmarks described below.

Reference Documentation#

Refer to the following more information about the technologies used in these benchmarks:

GPU_CAGRA Index: Milvus GPU_CAGRA Documentation: Official documentation for GPU_CAGRA index configuration and usage
GPU_CAGRA Technical Overview: Milvus Introduces GPU Index CAGRA: Technical deep dive into CAGRA architecture and performance.
Milvus GPU Configuration: Install Milvus Standalone with Docker Compose (GPU): GPU setup and configuration guide
CAGRA Parameter Tuning: cuVS Automated Tuning Guide: Guide for automated hyper-parameter optimization (HPO) of CAGRA graph parameters using GPU-accelerated tuning
Comparing Vector Indexes: cuVS Comparing Performance of Vector Indexes: Methodology for recall-aware performance comparison and benchmarking of vector search indexes

Test Environment#

Hardware Configuration#

Component	Specification
GPU	NVIDIA L40 (48GB VRAM)
CPU	AMD EPYC 7232P 8-Core Processor (16 threads)
RAM	126GB DDR4
Storage	1.7TB NVMe SSD
OS	Ubuntu 22.04 LTS (Linux 5.15.0-153)

Software Stack#

Component	Version
Milvus	2.4.4-gpu (standalone mode)
CUDA	13.0
NVIDIA Driver	580.65.06
Index Type	GPU_CAGRA
Storage Backend	LocalStack S3 (development)

1. Bulk Ingestion Performance#

Overview#

Bulk ingestion benchmarks measure the throughput of loading pre-computed embeddings into Milvus with GPU_CAGRA indexing. This is useful for initial dataset loading or batch updates.

How to Run Bulk Ingestion Benchmarks#

Step 1: Generate Embedding Data#

source .venv/bin/activate

Use the generate_data.py script to create synthetic embedding datasets in Parquet format:

# Generate 10M vectors (256-dim, float32)
python3 scripts/evals/bulk_ingestion_performance/generate_data.py \
  --output benchmark_data/embeddings_10m.parquet \
  --num-vectors 10000000 \
  --embedding-dim 256 \
  --batch-size 250000

Parameters:

--output: The output Parquet file path
--num-vectors: The number of vectors to generate (default: 1M)
--embedding-dim: The embedding dimension (default: 256)
--batch-size: The batch size for writing (default: 100K)

Output: A parquet file with the following schema:

id (string): A unique vector ID
embedding (list of float32): A normalized embedding vector
$meta (string): JSON metadata for each vector

Step 2: Run the Benchmark#

Use the run_benchmark.py script to upload data to S3 and measure ingestion throughput:

# Full benchmark (upload to localstack + create collection + ingest)
python3 scripts/evals/bulk_ingestion_performance/run_benchmark.py \
  --parquet-file benchmark_data/embeddings_10m.parquet \
  --num-vectors 10000000

# Skip upload if data already in S3
python3 scripts/evals/bulk_ingestion_performance/run_benchmark.py \
  --parquet-file benchmark_data/embeddings_10m.parquet \
  --num-vectors 10000000 \
  --skip-upload

# Use existing collection
python3 scripts/evals/bulk_ingestion_performance/run_benchmark.py \
  --parquet-file benchmark_data/embeddings_10m.parquet \
  --num-vectors 10000000 \
  --skip-upload \
  --skip-create \
  --collection-id <collection_id>

The benchmark will:

Upload the Parquet file to LocalStack S3 (if not skipped).
Create a GPU_CAGRA collection (if not skipped).
Initiate bulk insert via /v1/insert-data API.
Monitor progress every 5 seconds.
Report final throughput metrics.

Step 3: Configuration#

Key Settings:

GPU Memory Pool: 16GB init / 30GB max (optimized for L40 48GB VRAM)
Segment Size: 2GB (optimal for GPU_CAGRA indexing)
Build Parallel: 1 (CUVS team recommendation for single GPU)
Concurrent Import Tasks: 16 (balanced for S3 I/O)
Storage V2: Disabled (required for Milvus 2.4.4 bulk import compatibility)

10 Million Embeddings Benchmark#

Test Configuration:

Dataset Size: 10,000,000 vectors
Embedding Dimension: 256-dimensional float32
File Size: 9.9 GB (parquet format)
Batch Size: 250,000 vectors per batch
Segment Size: 2GB (optimized)

Results:

Metric	Value
Total Time	419.38 seconds
Average Throughput	23,845 vectors/second
Throughput (per minute)	1,430,700 vectors/minute
Throughput (per hour)	85,842,000 vectors/hour

Performance Breakdown#

The bulk ingestion process consists of three main phases:

Data Upload to S3: ~20 seconds (S3 transfer)
Data Loading: ~280 seconds (61% of total time)
GPU Index Building: ~140 seconds (32% of total time)

GPU Indexing Performance (Isolated)#

When measured independently (without S3 I/O overhead), the GPU_CAGRA indexing performance for the setup is significantly higher:

Metric	Value
Pure GPU Indexing Throughput	~36,966 vectors/second

Key Insight: The GPU can index vectors 2x faster than the end-to-end throughput, confirming that S3 I/O is the primary bottleneck, not GPU compute capacity. The GPU spends significant time idle, waiting for data to arrive from storage.

GPU Memory Allocation#

The system uses a conservative GPU memory allocation strategy:

Cosmos-embed NIM: ~12 GB (embedding model)
Milvus GPU Pool: 13.4 GB active (configured: 16GB init / 30GB max)
Available Headroom: ~28 GB unused

Performance Analysis#

Current Scenario#

S3 I/O Throughput (Primary Bottleneck)

LocalStack adds significant overhead compared to production S3.
The data loading phase dominates the total processing time (61%).
There is network transfer and deserialization overhead.

Optimization Opportunities#

High-Impact Optimizations#

Production S3 Backend
- Expected Improvement: 2-3x throughput
- Replaces LocalStack with AWS S3 or MinIO cluster
- Reduces I/O bottleneck significantly
Increase Segment Size
- Current: 2GB segments (~2M vectors)
- Recommended: 4GB segments (~4M vectors)
- Benefit: Better GPU batching, fewer index builds
Pipeline Parallelism
- Build parallel: buildParallel=2
- Benefit: Overlap data loading with index building
- Risk: Potential GPU memory contention
Increase Concurrent Tasks
- Current: 16 concurrent import tasks
- Recommended: 32-64 concurrent tasks
- Benefit: Better S3 parallelization, keeps GPU fed

Medium-Impact Optimizations#

Larger Read Buffers
- Increases from 16MB to 64-128MB
- Reduces number of S3 requests
Pre-stage Data
- Uploads data to S3 before benchmarking
- Measures pure indexing performance
Multiple Parquet Files
- Splits 10M vectors into multiple files
- Enables parallel file processing

Expected Performance with Optimizations#

Optimization Level	Expected Throughput	Estimated Time (10M)
Current	23,845 vec/s	6.98 minutes

2. Video Ingestion Performance#

Overview#

Video ingestion benchmarks measure the end-to-end performance of processing video files, generating embeddings using Cosmos Embed1 NIM, and storing them in Milvus. This represents the real-world use case of ingesting video datasets.

How to Run Video Ingestion Benchmarks#

Step 1: Download Video Dataset#

Download the MSR-VTT test dataset, which contains 1000 videos:

# Download using prepared configuration
make prepare-dataset CONFIG=scripts/msrvtt_test_1000.yaml

# Videos will be downloaded to: ~/datasets/msrvtt/videos/

Step 2: Host Videos via HTTP#

Start a local HTTP server to serve the video files:

# Host videos on local network interface
python3 scripts/evals/video_ingestion_performance/http_file_host.py \
  --url-host-mode interface \
  --interface enp2s0f0np0 \
  --root ~/datasets/msrvtt/videos \
  --port 8234 \
  --csv-path hosted_files.csv

This will perform the following steps:

Start an HTTP server on port 8234.
Generate a CSV file (hosted_files.csv) with video URLs.
Output the base URL (e.g. http://10.63.179.185:8234).

Parameters:

--url-host-mode: The method for determining the host URL (interface, localhost, or ip)
--interface: The network interface name (e.g., enp2s0f0np0)
--root: The root directory containing videos
--port: The HTTP server port
--csv-path: The output CSV file with video URLs

Step 3: Create Collection#

Create a new collection for video ingestion:

cds collections create \
  --pipeline cosmos_video_search_milvus \
  --name "MSR-VTT Performance Test"

Note the returned collection_id value for the next step.

Step 4: Run Ingestion Benchmark#

Ingest the videos using the url_ingestion.py script:

# Run video ingestion benchmark
python3 scripts/evals/video_ingestion_performance/url_ingestion.py \
  --collection-id <collection_id> \
  --csv hosted_files.csv \
  --verbose \
  --max-videos 1000 \
  --batch-size 64 \
  --base-url http://localhost:8888 \
  --nim-base-url http://localhost:9000 \
  --measure-embed-delta \
  --csv-out report.csv

Parameters:

--collection-id: The target collection ID
--csv: A CSV file with video URLs (generated in Step 2)
--max-videos: The maximum number of videos to process
--batch-size: The number of videos per batch sent to the embedding service
--base-url: The CDS API endpoint
--nim-base-url: The Cosmos Embed1 NIM endpoint
--measure-embed-delta: A flag that enables measuring the NIM embedding latency separately
--csv-out: The output CSV file for detailed per-batch metrics

MSR-VTT 1000 Videos Benchmark#

Test Configuration:

Dataset: MSR-VTT test set (1000 videos)
Video Format: MP4
Embedding Model: Cosmos Embed1 NIM (running on the same L40 GPU)
Batch Size: 64 videos per batch
Total Batches: 16 batches (15 full batches of 64, plus 1 partial batch of 40)
Video Hosting: Local HTTP server (port 8234)

Results:

Metric	Value
Total Videos	1000
Total Time	53.09 seconds
Average Throughput	18.84 videos/second
Throughput (per minute)	1,130 videos/minute
Throughput (per hour)	67,800 videos/hour
Per-Video Processing Time	0.053 seconds
Success Rate	100% (1000/1000)

Per-Batch Performance Breakdown#

CDS API (End-to-End):

Metric	Value
Min Latency	2.09 seconds
Avg Latency	3.14 seconds
Max Latency	4.25 seconds

Cosmos-embed NIM (Embedding Generation):

Metric	Value
Min Latency	1.71 seconds
Avg Latency	2.75 seconds
Max Latency	2.96 seconds

CDS Overhead (CDS API - NIM):

Metric	Value
Avg Overhead	0.39 seconds
Overhead Percentage	12.4% of total time

Performance Analysis#

Key Insights:

Embedding generation dominates the processing time: The Cosmos-embed NIM takes ~2.75s per batch (64 videos), which is 87.6% of total processing time. This is the primary bottleneck.
CDS API overhead is minimal: The CDS API overhead (video download from HTTP, Milvus insertion, orchestration) is only ~0.39s per batch (12.4% of total time), showing efficient pipeline implementation.
Batch processing is efficient: Processing 64 videos in ~3.14s achieves high GPU utilization on the embedding model, with minimal per-video overhead.
Consistent performance: Low latency variance (2.09s - 4.25s) indicates stable processing across all batches.
High throughput: At 18.84 videos/second, the system can process approximately 68K videos per hour, making it suitable for large-scale video dataset ingestion.

Detailed Batch Statistics#

The following are samples of per-batch performance:

Batch	Size	CDS Latency (s)	NIM Latency (s)	Overhead (s)
1	64	4.25	2.86	1.39
2	64	2.92	2.77	0.16
3	64	3.26	2.79	0.46
8	64	3.19	2.80	0.38
15	64	2.98	2.83	0.15
16	40	2.09	1.71	0.38

Observations:

The first batch has higher latency (4.25s) due to cold start / initialization.
Subsequent batches stabilize around 3.0-3.3s.
The last batch of 40 videos is proportionally faster (2.09s).
The NIM latency is very consistent (2.75s ± 0.1s).

Video Ingestion Optimization Opportunities#

Optimize Batch Size
- Current: 64 videos/batch
- Trade-off: Larger batches = better GPU utilization but longer latency per batch
- Smaller batches = more frequent progress updates but potential GPU underutilization
- Recommendation: Test batch sizes 16, 32, 64 to find optimal balance
Dedicated GPU for Embedding
- Current: Cosmos-embed NIM shares L40 GPU with Milvus
- Alternative: Dedicated GPU for embeddings
- Benefit: Eliminate GPU contention, potentially 20-30% throughput improvement
Parallel Processing
- Current: Single-threaded batch processing
- Alternative: Multiple parallel workers with smaller batches, cds cli implements Ray workers for ingestion
- Benefit: Better utilization of multi-core CPU for video decoding and HTTP downloads

3. Text Search Latency Performance#

Overview#

Search latency benchmarks measure the query response time for semantic text-to-video search over large collections. This represents the end-user experience when searching the video dataset.

How to Run Search Latency Benchmarks#

Use the latency_test.py script to measure search performance:

# Run 60-second latency test with 200 diverse queries
python3 scripts/evals/video_ingestion_performance/latency_test.py \
  --base-url http://localhost:8888 \
  --latency-test \
  --verbose \
  --collection-id <collection_id> \
  --csv-out latency_report.csv \
  --nim-base-url http://localhost:9000 \
  --query-pool-size 200 \
  --duration 60 \
  --top-k 20

Parameters:

--base-url: The CDS API endpoint
--collection-id: The target collection ID
--nim-base-url: The Cosmos Embed1 NIM endpoint
--query-pool-size: The number of diverse queries to generate (default: 200)
--duration: The test duration in seconds (default: 60)
--top-k: The number of results to retrieve per query (default: 20)
--csv-out: The output CSV file for detailed results

The benchmark will perform the following steps:

Generate a diverse pool of text queries.
Measure baseline API latency (i.e. health checks).
Run continuous searches for the specified duration.
Measure NIM embedding latency separately.
Report detailed latency statistics and breakdowns.

10 Million Embeddings Search Benchmark#

Test Configuration:

Collection Size: 10,000,000 embeddings (256-dim)
Index Type: GPU_CAGRA
Query Pool: 200 diverse text queries
Test Duration: 60 seconds
Top-K: 20 results per query
Pattern: Continuous (no delays between queries)

Results:

Metric	Value
Total Searches	190 queries
Test Duration	60.2 seconds
Throughput	3.16 searches/second
Min Latency	0.147 seconds
Avg Latency	0.317 seconds
Max Latency	0.464 seconds

Latency Breakdown#

Component	Avg Latency	Percentage
Total End-to-End	0.317 seconds	100%
Milvus Search	0.305 seconds	96.2%
Cosmos-embed NIM	0.007 seconds	2.2%
Network/API Overhead	0.004 seconds	1.3%
CDS API Baseline	0.004 seconds	1.3%

Key Insights:

Milvus GPU search dominates: The GPU_CAGRA index search takes 96.2% of total latency (0.305s), which is expected for a 10M vector collection.
Embedding generation is fast: Text-to-embedding conversion via Cosmos-embed NIM is only 0.007s (2.2%), showing excellent NIM performance.
Low API Overhead: The CDS API adds minimal overhead (0.004s or 1.3%).
Consistent performance: Latency variance is low (0.147s-0.464s), with most queries completing in the 0.2-0.4s range.
Sub-second response times: The average end-to-end latency is 317ms, providing a good user experience for semantic search.

Query Performance Distribution#

The following are samples of query latencies from the benchmark:

Query	Count	Min (s)	Avg (s)	Max (s)
waves lapping against a pier	1	0.147	0.147	0.147
cargo ship entering harbor	2	0.191	0.193	0.195
penguins waddling on ice	2	0.195	0.198	0.201
street food vendor cooking	5	0.197	0.321	0.404
chocolate melting in bowl	3	0.382	0.397	0.409
loading clothes into washer	1	0.464	0.464	0.464

Observations:

Simple queries (e.g. “waves lapping”) complete faster (~0.15s).
Complex queries (e.g. “loading clothes”) take longer (~0.46s).
Repeated queries show consistent latency.
Most queries complete in the 0.2-0.4s range.

Search Latency Optimization Opportunities#

Increase GPU Memory Allocation
- Current: 13.4GB active (29% of 46GB available)
- Recommended: Increase to 20-25GB
- Benefit: More index data cached on GPU, faster searches
Tune GPU_CAGRA Parameters
- Current: Default settings
- Optimize: intermediate_graph_degree, graph_degree, itopk_size
- Trade-off: Search speed vs. recall accuracy
Query Batching
- Current: Single query per request
- Alternative: Batch multiple queries
- Benefit: Better GPU utilization, higher throughput
Index Optimization
- Current: GPU_CAGRA with default build parameters
- Tune: Build-time parameters for search performance
- Benefit: Faster searches at cost of longer index build time

4. Configuration Parameters#

Key Milvus Settings#

# GPU Memory Pool (L40 48GB)
gpu:
  initMemSize: 16384  # 16GB initial
  maxMemSize: 30720   # 30GB maximum

# Segment Configuration
dataCoord:
  segment:
    maxSize: 2048  # 2GB segments

# Index Building
indexNode:
  scheduler:
    buildParallel: 1  # Conservative for stability

# Data Import
dataNode:
  import:
    maxConcurrentTaskNum: 16
    maxImportFileSizeInGB: 16
    readBufferSizeInMB: 16

Recommended Production Settings#

For production deployments with real S3:

# Aggressive segment sizing
dataCoord:
  segment:
    maxSize: 4096  # 4GB segments

# Increased parallelism
dataNode:
  import:
    maxConcurrentTaskNum: 32
    readBufferSizeInMB: 64

# Experimental: Pipeline overlap
indexNode:
  scheduler:
    buildParallel: 2  # Test carefully

Scaling Considerations#

GPU Memory Capacity#

The L40 GPU with 48GB of VRAM can accommodate the following (approximately):

With 30GB allocated to Milvus: ~25-30M vectors (256-dim) with GPU_CAGRA index
Theoretical maximum: ~40M vectors before requiring batch processing

Distributed vs. Standalone#

Standalone Mode (Current)#

Simpler configuration
Lower overhead
Suitable for <30M vectors
Single point of failure
Limited by single GPU

Distributed Mode (Multi-node: DataNode and IndexNode)#

Allows horizontal scaling by deploying multiple DataNode and IndexNode processes.
Enables parallel data ingestion and simultaneous GPU index building across nodes.
Allows higher throughput with multiple GPUs, as DataNode and IndexNode workloads are distributed.
Includes built-in fault tolerance, since node failures don’t halt the cluster.
Includes more complex cluster management and monitoring.
Includes higher operational and infrastructure overhead, but supports >50M or 100M+ vectors and production-scale workloads.

5. Summary and Recommendations#

Performance Summary#

Benchmark Type	Key Metric	Value	Notes
Bulk Ingestion	Throughput	23,845 vectors/s	10M embeddings in 6.98 min
Video Ingestion	Throughput	18.84 videos/s	1000 videos in 53 seconds
Search Latency	Avg Response	0.317 seconds	10M collection, top-20

For Development/Testing#

Current configuration is optimal.
Focus on application development.
LocalStack S3 is sufficient.

For Production Deployment#

Immediate Actions
- Switch to production S3 (AWS/MinIO cluster).
- Increase segment size to 4GB.
- Increase concurrent tasks to 32.
Performance Tuning
- Monitor GPU utilization during bulk imports.
- Adjust buildParallel based on memory usage.
- Tune read buffer sizes based on S3 latency.
Scaling Strategy
- Use standalone mode for <30M vectors.
- Consider using distributed mode for >50M vectors.
- Plan to use multiple L40 GPUs for >100M vectors.

Last Updated: October 24, 2025
Benchmark Version: 1.0
Configuration: milvus_l40_standalone_optimized.yaml