Performance Benchmarks#
Overview#
This document provides comprehensive performance benchmarks for the Cosmos Dataset Search (CDS) system, covering three key performance areas:
Bulk Ingestion Performance: GPU-accelerated embedding ingestion using Milvus GPU_CAGRA indexing
Video Ingestion Performance: End-to-end video processing and embedding generation
Text Search Latency Performance: Query response times for semantic search
All benchmarks were conducted on NVIDIA L40 GPU hardware with the optimized Milvus configuration from /deploy/standalone/milvus_l40_standalone_optimized.yaml.
To apply this configuration, ensure it is mounted in /deploy/standalone/docker-compose.build.yml under the milvus service (using a volumes entry):
services:
milvus:
# ...
volumes:
- ./deploy/standalone/milvus_l40_standalone_optimized.yaml:/milvus/configs/milvus.yaml
This ensures the Milvus instance uses the performance-optimized configuration for the benchmarks described below.
Reference Documentation#
Refer to the following more information about the technologies used in these benchmarks:
GPU_CAGRA Index: Milvus GPU_CAGRA Documentation: Official documentation for
GPU_CAGRAindex configuration and usageGPU_CAGRA Technical Overview: Milvus Introduces GPU Index CAGRA: Technical deep dive into CAGRA architecture and performance.
Milvus GPU Configuration: Install Milvus Standalone with Docker Compose (GPU): GPU setup and configuration guide
CAGRA Parameter Tuning: cuVS Automated Tuning Guide: Guide for automated hyper-parameter optimization (HPO) of CAGRA graph parameters using GPU-accelerated tuning
Comparing Vector Indexes: cuVS Comparing Performance of Vector Indexes: Methodology for recall-aware performance comparison and benchmarking of vector search indexes
Test Environment#
Hardware Configuration#
Component |
Specification |
|---|---|
GPU |
NVIDIA L40 (48GB VRAM) |
CPU |
AMD EPYC 7232P 8-Core Processor (16 threads) |
RAM |
126GB DDR4 |
Storage |
1.7TB NVMe SSD |
OS |
Ubuntu 22.04 LTS (Linux 5.15.0-153) |
Software Stack#
Component |
Version |
|---|---|
Milvus |
2.4.4-gpu (standalone mode) |
CUDA |
13.0 |
NVIDIA Driver |
580.65.06 |
Index Type |
GPU_CAGRA |
Storage Backend |
LocalStack S3 (development) |
1. Bulk Ingestion Performance#
Overview#
Bulk ingestion benchmarks measure the throughput of loading pre-computed embeddings into Milvus with GPU_CAGRA indexing. This is useful for initial dataset loading or batch updates.
How to Run Bulk Ingestion Benchmarks#
Step 1: Generate Embedding Data#
source .venv/bin/activate
Use the generate_data.py script to create synthetic embedding datasets in Parquet format:
# Generate 10M vectors (256-dim, float32)
python3 scripts/evals/bulk_ingestion_performance/generate_data.py \
--output benchmark_data/embeddings_10m.parquet \
--num-vectors 10000000 \
--embedding-dim 256 \
--batch-size 250000
Parameters:
--output: The output Parquet file path--num-vectors: The number of vectors to generate (default: 1M)--embedding-dim: The embedding dimension (default: 256)--batch-size: The batch size for writing (default: 100K)
Output: A parquet file with the following schema:
id(string): A unique vector IDembedding(list of float32): A normalized embedding vector$meta(string): JSON metadata for each vector
Step 2: Run the Benchmark#
Use the run_benchmark.py script to upload data to S3 and measure ingestion throughput:
# Full benchmark (upload to localstack + create collection + ingest)
python3 scripts/evals/bulk_ingestion_performance/run_benchmark.py \
--parquet-file benchmark_data/embeddings_10m.parquet \
--num-vectors 10000000
# Skip upload if data already in S3
python3 scripts/evals/bulk_ingestion_performance/run_benchmark.py \
--parquet-file benchmark_data/embeddings_10m.parquet \
--num-vectors 10000000 \
--skip-upload
# Use existing collection
python3 scripts/evals/bulk_ingestion_performance/run_benchmark.py \
--parquet-file benchmark_data/embeddings_10m.parquet \
--num-vectors 10000000 \
--skip-upload \
--skip-create \
--collection-id <collection_id>
The benchmark will:
Upload the Parquet file to LocalStack S3 (if not skipped).
Create a
GPU_CAGRAcollection (if not skipped).Initiate bulk insert via
/v1/insert-dataAPI.Monitor progress every 5 seconds.
Report final throughput metrics.
Step 3: Configuration#
Key Settings:
GPU Memory Pool: 16GB init / 30GB max (optimized for L40 48GB VRAM)
Segment Size: 2GB (optimal for GPU_CAGRA indexing)
Build Parallel: 1 (CUVS team recommendation for single GPU)
Concurrent Import Tasks: 16 (balanced for S3 I/O)
Storage V2: Disabled (required for Milvus 2.4.4 bulk import compatibility)
10 Million Embeddings Benchmark#
Test Configuration:
Dataset Size: 10,000,000 vectors
Embedding Dimension: 256-dimensional float32
File Size: 9.9 GB (parquet format)
Batch Size: 250,000 vectors per batch
Segment Size: 2GB (optimized)
Results:
Metric |
Value |
|---|---|
Total Time |
419.38 seconds |
Average Throughput |
23,845 vectors/second |
Throughput (per minute) |
1,430,700 vectors/minute |
Throughput (per hour) |
85,842,000 vectors/hour |
Performance Breakdown#
The bulk ingestion process consists of three main phases:
Data Upload to S3: ~20 seconds (S3 transfer)
Data Loading: ~280 seconds (61% of total time)
GPU Index Building: ~140 seconds (32% of total time)
GPU Indexing Performance (Isolated)#
When measured independently (without S3 I/O overhead), the GPU_CAGRA indexing performance for the setup is significantly higher:
Metric |
Value |
|---|---|
Pure GPU Indexing Throughput |
~36,966 vectors/second |
Key Insight: The GPU can index vectors 2x faster than the end-to-end throughput, confirming that S3 I/O is the primary bottleneck, not GPU compute capacity. The GPU spends significant time idle, waiting for data to arrive from storage.
GPU Memory Allocation#
The system uses a conservative GPU memory allocation strategy:
Cosmos-embed NIM: ~12 GB (embedding model)
Milvus GPU Pool: 13.4 GB active (configured: 16GB init / 30GB max)
Available Headroom: ~28 GB unused
Performance Analysis#
Current Scenario#
S3 I/O Throughput (Primary Bottleneck)
LocalStack adds significant overhead compared to production S3.
The data loading phase dominates the total processing time (61%).
There is network transfer and deserialization overhead.
Optimization Opportunities#
High-Impact Optimizations#
Production S3 Backend
Expected Improvement: 2-3x throughput
Replaces LocalStack with AWS S3 or MinIO cluster
Reduces I/O bottleneck significantly
Increase Segment Size
Current: 2GB segments (~2M vectors)
Recommended: 4GB segments (~4M vectors)
Benefit: Better GPU batching, fewer index builds
Pipeline Parallelism
Build parallel:
buildParallel=2Benefit: Overlap data loading with index building
Risk: Potential GPU memory contention
Increase Concurrent Tasks
Current: 16 concurrent import tasks
Recommended: 32-64 concurrent tasks
Benefit: Better S3 parallelization, keeps GPU fed
Medium-Impact Optimizations#
Larger Read Buffers
Increases from 16MB to 64-128MB
Reduces number of S3 requests
Pre-stage Data
Uploads data to S3 before benchmarking
Measures pure indexing performance
Multiple Parquet Files
Splits 10M vectors into multiple files
Enables parallel file processing
Expected Performance with Optimizations#
Optimization Level |
Expected Throughput |
Estimated Time (10M) |
|---|---|---|
Current |
23,845 vec/s |
6.98 minutes |
2. Video Ingestion Performance#
Overview#
Video ingestion benchmarks measure the end-to-end performance of processing video files, generating embeddings using Cosmos Embed1 NIM, and storing them in Milvus. This represents the real-world use case of ingesting video datasets.
How to Run Video Ingestion Benchmarks#
Step 1: Download Video Dataset#
Download the MSR-VTT test dataset, which contains 1000 videos:
# Download using prepared configuration
make prepare-dataset CONFIG=scripts/msrvtt_test_1000.yaml
# Videos will be downloaded to: ~/datasets/msrvtt/videos/
Step 2: Host Videos via HTTP#
Start a local HTTP server to serve the video files:
# Host videos on local network interface
python3 scripts/evals/video_ingestion_performance/http_file_host.py \
--url-host-mode interface \
--interface enp2s0f0np0 \
--root ~/datasets/msrvtt/videos \
--port 8234 \
--csv-path hosted_files.csv
This will perform the following steps:
Start an HTTP server on port 8234.
Generate a CSV file (
hosted_files.csv) with video URLs.Output the base URL (e.g.
http://10.63.179.185:8234).
Parameters:
--url-host-mode: The method for determining the host URL (interface,localhost, orip)--interface: The network interface name (e.g.,enp2s0f0np0)--root: The root directory containing videos--port: The HTTP server port--csv-path: The output CSV file with video URLs
Step 3: Create Collection#
Create a new collection for video ingestion:
cds collections create \
--pipeline cosmos_video_search_milvus \
--name "MSR-VTT Performance Test"
Note the returned collection_id value for the next step.
Step 4: Run Ingestion Benchmark#
Ingest the videos using the url_ingestion.py script:
# Run video ingestion benchmark
python3 scripts/evals/video_ingestion_performance/url_ingestion.py \
--collection-id <collection_id> \
--csv hosted_files.csv \
--verbose \
--max-videos 1000 \
--batch-size 64 \
--base-url http://localhost:8888 \
--nim-base-url http://localhost:9000 \
--measure-embed-delta \
--csv-out report.csv
Parameters:
--collection-id: The target collection ID--csv: A CSV file with video URLs (generated in Step 2)--max-videos: The maximum number of videos to process--batch-size: The number of videos per batch sent to the embedding service--base-url: The CDS API endpoint--nim-base-url: The Cosmos Embed1 NIM endpoint--measure-embed-delta: A flag that enables measuring the NIM embedding latency separately--csv-out: The output CSV file for detailed per-batch metrics
MSR-VTT 1000 Videos Benchmark#
Test Configuration:
Dataset: MSR-VTT test set (1000 videos)
Video Format: MP4
Embedding Model: Cosmos Embed1 NIM (running on the same L40 GPU)
Batch Size: 64 videos per batch
Total Batches: 16 batches (15 full batches of 64, plus 1 partial batch of 40)
Video Hosting: Local HTTP server (port 8234)
Results:
Metric |
Value |
|---|---|
Total Videos |
1000 |
Total Time |
53.09 seconds |
Average Throughput |
18.84 videos/second |
Throughput (per minute) |
1,130 videos/minute |
Throughput (per hour) |
67,800 videos/hour |
Per-Video Processing Time |
0.053 seconds |
Success Rate |
100% (1000/1000) |
Per-Batch Performance Breakdown#
CDS API (End-to-End):
Metric |
Value |
|---|---|
Min Latency |
2.09 seconds |
Avg Latency |
3.14 seconds |
Max Latency |
4.25 seconds |
Cosmos-embed NIM (Embedding Generation):
Metric |
Value |
|---|---|
Min Latency |
1.71 seconds |
Avg Latency |
2.75 seconds |
Max Latency |
2.96 seconds |
CDS Overhead (CDS API - NIM):
Metric |
Value |
|---|---|
Avg Overhead |
0.39 seconds |
Overhead Percentage |
12.4% of total time |
Performance Analysis#
Key Insights:
Embedding generation dominates the processing time: The Cosmos-embed NIM takes ~2.75s per batch (64 videos), which is 87.6% of total processing time. This is the primary bottleneck.
CDS API overhead is minimal: The CDS API overhead (video download from HTTP, Milvus insertion, orchestration) is only ~0.39s per batch (12.4% of total time), showing efficient pipeline implementation.
Batch processing is efficient: Processing 64 videos in ~3.14s achieves high GPU utilization on the embedding model, with minimal per-video overhead.
Consistent performance: Low latency variance (2.09s - 4.25s) indicates stable processing across all batches.
High throughput: At 18.84 videos/second, the system can process approximately 68K videos per hour, making it suitable for large-scale video dataset ingestion.
Detailed Batch Statistics#
The following are samples of per-batch performance:
Batch |
Size |
CDS Latency (s) |
NIM Latency (s) |
Overhead (s) |
|---|---|---|---|---|
1 |
64 |
4.25 |
2.86 |
1.39 |
2 |
64 |
2.92 |
2.77 |
0.16 |
3 |
64 |
3.26 |
2.79 |
0.46 |
8 |
64 |
3.19 |
2.80 |
0.38 |
15 |
64 |
2.98 |
2.83 |
0.15 |
16 |
40 |
2.09 |
1.71 |
0.38 |
Observations:
The first batch has higher latency (4.25s) due to cold start / initialization.
Subsequent batches stabilize around 3.0-3.3s.
The last batch of 40 videos is proportionally faster (2.09s).
The NIM latency is very consistent (2.75s ± 0.1s).
Video Ingestion Optimization Opportunities#
Optimize Batch Size
Current: 64 videos/batch
Trade-off: Larger batches = better GPU utilization but longer latency per batch
Smaller batches = more frequent progress updates but potential GPU underutilization
Recommendation: Test batch sizes 16, 32, 64 to find optimal balance
Dedicated GPU for Embedding
Current: Cosmos-embed NIM shares L40 GPU with Milvus
Alternative: Dedicated GPU for embeddings
Benefit: Eliminate GPU contention, potentially 20-30% throughput improvement
Parallel Processing
Current: Single-threaded batch processing
Alternative: Multiple parallel workers with smaller batches,
cdscli implements Ray workers for ingestionBenefit: Better utilization of multi-core CPU for video decoding and HTTP downloads
3. Text Search Latency Performance#
Overview#
Search latency benchmarks measure the query response time for semantic text-to-video search over large collections. This represents the end-user experience when searching the video dataset.
How to Run Search Latency Benchmarks#
Use the latency_test.py script to measure search performance:
# Run 60-second latency test with 200 diverse queries
python3 scripts/evals/video_ingestion_performance/latency_test.py \
--base-url http://localhost:8888 \
--latency-test \
--verbose \
--collection-id <collection_id> \
--csv-out latency_report.csv \
--nim-base-url http://localhost:9000 \
--query-pool-size 200 \
--duration 60 \
--top-k 20
Parameters:
--base-url: The CDS API endpoint--collection-id: The target collection ID--nim-base-url: The Cosmos Embed1 NIM endpoint--query-pool-size: The number of diverse queries to generate (default: 200)--duration: The test duration in seconds (default: 60)--top-k: The number of results to retrieve per query (default: 20)--csv-out: The output CSV file for detailed results
The benchmark will perform the following steps:
Generate a diverse pool of text queries.
Measure baseline API latency (i.e. health checks).
Run continuous searches for the specified duration.
Measure NIM embedding latency separately.
Report detailed latency statistics and breakdowns.
10 Million Embeddings Search Benchmark#
Test Configuration:
Collection Size: 10,000,000 embeddings (256-dim)
Index Type:
GPU_CAGRAQuery Pool: 200 diverse text queries
Test Duration: 60 seconds
Top-K: 20 results per query
Pattern: Continuous (no delays between queries)
Results:
Metric |
Value |
|---|---|
Total Searches |
190 queries |
Test Duration |
60.2 seconds |
Throughput |
3.16 searches/second |
Min Latency |
0.147 seconds |
Avg Latency |
0.317 seconds |
Max Latency |
0.464 seconds |
Latency Breakdown#
Component |
Avg Latency |
Percentage |
|---|---|---|
Total End-to-End |
0.317 seconds |
100% |
Milvus Search |
0.305 seconds |
96.2% |
Cosmos-embed NIM |
0.007 seconds |
2.2% |
Network/API Overhead |
0.004 seconds |
1.3% |
CDS API Baseline |
0.004 seconds |
1.3% |
Key Insights:
Milvus GPU search dominates: The
GPU_CAGRAindex search takes 96.2% of total latency (0.305s), which is expected for a 10M vector collection.Embedding generation is fast: Text-to-embedding conversion via Cosmos-embed NIM is only 0.007s (2.2%), showing excellent NIM performance.
Low API Overhead: The CDS API adds minimal overhead (0.004s or 1.3%).
Consistent performance: Latency variance is low (0.147s-0.464s), with most queries completing in the 0.2-0.4s range.
Sub-second response times: The average end-to-end latency is 317ms, providing a good user experience for semantic search.
Query Performance Distribution#
The following are samples of query latencies from the benchmark:
Query |
Count |
Min (s) |
Avg (s) |
Max (s) |
|---|---|---|---|---|
waves lapping against a pier |
1 |
0.147 |
0.147 |
0.147 |
cargo ship entering harbor |
2 |
0.191 |
0.193 |
0.195 |
penguins waddling on ice |
2 |
0.195 |
0.198 |
0.201 |
street food vendor cooking |
5 |
0.197 |
0.321 |
0.404 |
chocolate melting in bowl |
3 |
0.382 |
0.397 |
0.409 |
loading clothes into washer |
1 |
0.464 |
0.464 |
0.464 |
Observations:
Simple queries (e.g. “waves lapping”) complete faster (~0.15s).
Complex queries (e.g. “loading clothes”) take longer (~0.46s).
Repeated queries show consistent latency.
Most queries complete in the 0.2-0.4s range.
Search Latency Optimization Opportunities#
Increase GPU Memory Allocation
Current: 13.4GB active (29% of 46GB available)
Recommended: Increase to 20-25GB
Benefit: More index data cached on GPU, faster searches
Tune GPU_CAGRA Parameters
Current: Default settings
Optimize:
intermediate_graph_degree,graph_degree,itopk_sizeTrade-off: Search speed vs. recall accuracy
Query Batching
Current: Single query per request
Alternative: Batch multiple queries
Benefit: Better GPU utilization, higher throughput
Index Optimization
Current: GPU_CAGRA with default build parameters
Tune: Build-time parameters for search performance
Benefit: Faster searches at cost of longer index build time
4. Configuration Parameters#
Key Milvus Settings#
# GPU Memory Pool (L40 48GB)
gpu:
initMemSize: 16384 # 16GB initial
maxMemSize: 30720 # 30GB maximum
# Segment Configuration
dataCoord:
segment:
maxSize: 2048 # 2GB segments
# Index Building
indexNode:
scheduler:
buildParallel: 1 # Conservative for stability
# Data Import
dataNode:
import:
maxConcurrentTaskNum: 16
maxImportFileSizeInGB: 16
readBufferSizeInMB: 16
Recommended Production Settings#
For production deployments with real S3:
# Aggressive segment sizing
dataCoord:
segment:
maxSize: 4096 # 4GB segments
# Increased parallelism
dataNode:
import:
maxConcurrentTaskNum: 32
readBufferSizeInMB: 64
# Experimental: Pipeline overlap
indexNode:
scheduler:
buildParallel: 2 # Test carefully
Scaling Considerations#
GPU Memory Capacity#
The L40 GPU with 48GB of VRAM can accommodate the following (approximately):
With 30GB allocated to Milvus: ~25-30M vectors (256-dim) with
GPU_CAGRAindexTheoretical maximum: ~40M vectors before requiring batch processing
Distributed vs. Standalone#
Standalone Mode (Current)#
Simpler configuration
Lower overhead
Suitable for <30M vectors
Single point of failure
Limited by single GPU
Distributed Mode (Multi-node: DataNode and IndexNode)#
Allows horizontal scaling by deploying multiple DataNode and IndexNode processes.
Enables parallel data ingestion and simultaneous GPU index building across nodes.
Allows higher throughput with multiple GPUs, as DataNode and IndexNode workloads are distributed.
Includes built-in fault tolerance, since node failures don’t halt the cluster.
Includes more complex cluster management and monitoring.
Includes higher operational and infrastructure overhead, but supports >50M or 100M+ vectors and production-scale workloads.
5. Summary and Recommendations#
Performance Summary#
Benchmark Type |
Key Metric |
Value |
Notes |
|---|---|---|---|
Bulk Ingestion |
Throughput |
23,845 vectors/s |
10M embeddings in 6.98 min |
Video Ingestion |
Throughput |
18.84 videos/s |
1000 videos in 53 seconds |
Search Latency |
Avg Response |
0.317 seconds |
10M collection, top-20 |
For Development/Testing#
Current configuration is optimal.
Focus on application development.
LocalStack S3 is sufficient.
For Production Deployment#
Immediate Actions
Switch to production S3 (AWS/MinIO cluster).
Increase segment size to 4GB.
Increase concurrent tasks to 32.
Performance Tuning
Monitor GPU utilization during bulk imports.
Adjust
buildParallelbased on memory usage.Tune read buffer sizes based on S3 latency.
Scaling Strategy
Use standalone mode for <30M vectors.
Consider using distributed mode for >50M vectors.
Plan to use multiple L40 GPUs for >100M vectors.
Last Updated: October 24, 2025
Benchmark Version: 1.0
Configuration: milvus_l40_standalone_optimized.yaml