Benchmark Datasets
cuVS Bench datasets provide the vectors to index, the queries to search, and the exact nearest neighbors used to measure recall. This page explains the expected file layout, supported binary formats, built-in dataset helpers, ground-truth generation, and the YAML descriptors used for custom datasets.
Dataset files
Most datasets contain four binary files:
The vector files are used for build and search. Ground-truth files are tied to a distance metric and are used only for evaluation.
Binary format
Dataset suffixes describe the stored type:
All binary files are little-endian. The first 8 bytes store num_vectors and num_dimensions as uint32_t values. The remaining bytes store num_vectors * num_dimensions values in row-major order.
Some implementations can use float16 vectors for better performance. Convert float32 files with:
Built-in datasets
Use cuvs_bench.get_dataset to download and prepare common benchmark datasets. Files are stored under RAPIDS_DATASET_ROOT_DIR when set, or under a local datasets directory otherwise.
Common built-in datasets include:
These datasets include ground truth for 100 neighbors, so benchmark k must be 100 or smaller.
Dataset sources
Million-scale datasets are available from ann-benchmarks. Convert the HDF5 files to cuVS Bench binaries with:
Use -n to normalize base and query vectors, which is useful when measuring angular datasets as inner-product search.
Billion-scale datasets are available from big-ann-benchmarks. Split their combined ground-truth files before benchmarking:
This produces groundtruth.neighbors.ibin and groundtruth.distances.fbin.
The wiki-all dataset contains 88M 768-dimensional vectors, plus 1M and 10M subsets, for realistic RAG/LLM-scale benchmarking. See the Wiki-all Dataset Guide to download it.
Generate ground truth
If a dataset does not include ground truth, generate it with cuvs_bench.generate_groundtruth:
For billion-scale sources that provide ground truth for only the first 10M or 100M base vectors, use subset_size in the dataset configuration so the benchmark uses the matching prefix of the base file.
Dataset configurations
Each benchmark dataset needs a YAML descriptor with file names and basic properties. Common descriptors are available in datasets.yaml.
The default ${CUVS_HOME}/python/cuvs_bench/cuvs_bench/config/datasets/datasets.yaml includes entries like:
For a new dataset, create a descriptor such as mydataset.yaml:
Choose any name and pass it as --dataset. File paths are relative to --dataset-path. The optional subset_size uses the first subset_size vectors, which lets you benchmark subsets without duplicating base files. Generate separate ground truth for each subset.
Run the custom dataset with:
Summary
cuVS Bench expects a small set of binary vector and ground-truth files plus a YAML descriptor that tells the benchmark runner where those files live. Built-in helpers can download common datasets, convert HDF5 sources, split ground truth, and generate exact neighbors when needed. For custom datasets, prepare the files, write a descriptor, and pass it with --dataset-configuration.