Vamana | cuVS

Vamana is a graph-building algorithm used by DiskANN. Think of every vector as a point, and think of the graph as a map that connects each point to nearby points. DiskANN uses that map to search very large indexes efficiently, including indexes that are stored on SSD.

NVIDIA cuVS provides a GPU-accelerated Vamana build path. It builds the graph on the GPU, then serializes it in a format compatible with the open-source DiskANN library for CPU search.

Vamana works well when you want to build large DiskANN-compatible graph indexes quickly on the GPU and use them in CPU or SSD-backed DiskANN search workflows.

Example API Usage

C API | C++ API | Python API | Rust API

Vamana currently supports build and serialize operations in NVIDIA cuVS. Search is performed by loading the serialized index with DiskANN. Java and Go do not currently expose standalone Vamana bindings.

Building an index

C

C++

Python

Rust

1 #include <cuvs/core/c_api.h>
2 #include <cuvs/neighbors/vamana.h>
3 
4 cuvsResources_t res;
5 cuvsVamanaIndexParams_t index_params;
6 cuvsVamanaIndex_t index;
7 DLManagedTensor *dataset;
8 
9 // Populate tensor with row-major data.
10 load_dataset(dataset);
11 
12 cuvsResourcesCreate(&res);
13 cuvsVamanaIndexParamsCreate(&index_params);
14 cuvsVamanaIndexCreate(&index);
15 
16 index_params->metric = L2Expanded;
17 index_params->graph_degree = 64;
18 index_params->visited_size = 128;
19 index_params->queue_size = 255;
20 
21 cuvsVamanaBuild(res, index_params, dataset, index);
22 
23 cuvsVamanaIndexDestroy(index);
24 cuvsVamanaIndexParamsDestroy(index_params);
25 cuvsResourcesDestroy(res);

Serializing an index

C

C++

Python

Rust

1 #include <cuvs/core/c_api.h>
2 #include <cuvs/neighbors/vamana.h>
3 
4 cuvsResources_t res;
5 cuvsVamanaIndexParams_t index_params;
6 cuvsVamanaIndex_t index;
7 DLManagedTensor *dataset;
8 
9 cuvsResourcesCreate(&res);
10 cuvsVamanaIndexParamsCreate(&index_params);
11 cuvsVamanaIndexCreate(&index);
12 
13 // ... build index from dataset ...
14 cuvsVamanaBuild(res, index_params, dataset, index);
15 
16 // Writes DiskANN-compatible files using this path prefix.
17 cuvsVamanaSerialize(res, "/tmp/cuvs-vamana/index", index, true);
18 
19 cuvsVamanaIndexDestroy(index);
20 cuvsVamanaIndexParamsDestroy(index_params);
21 cuvsResourcesDestroy(res);

Loading a serialized index

NVIDIA cuVS Vamana writes DiskANN-compatible files but does not currently expose a Vamana deserialization or search API. Load the serialized output with DiskANN or another DiskANN-compatible search layer.

How Vamana works

Vamana builds a directed graph over the dataset. Each vector keeps up to graph_degree outgoing edges to other vectors.

At a high level, the builder:

Starts from a medoid vector, which is a central entry point into the graph.
Inserts vectors in batches.
Uses graph traversal to find candidate neighbors for each inserted vector.
Adds forward and reverse edges.
Prunes edges so each vector keeps a compact set of useful neighbors.

The visited_size and queue_size parameters control how much of the graph each insertion can explore. The alpha parameter controls pruning aggressiveness.

When to use Vamana

Use Vamana when you need a DiskANN-compatible graph and want GPU-accelerated build.

Vamana is a good fit for very large datasets, SSD-backed search workflows, and hybrid workflows where a GPU-built graph is converted to CPU for DiskANN search.

Avoid Vamana when you need an in-NVIDIA cuVS search API today. In that case, use CAGRA for GPU graph search, or use the serialized Vamana output with DiskANN for CPU search.

Interoperability with CPU DiskANN

The Vamana serialize APIs write files in a DiskANN-compatible format. This lets NVIDIA cuVS build the graph quickly on the GPU, then lets DiskANN load the serialized files for CPU search.

Set include_dataset = true when the serialized index should include the dataset. Set it to false when the dataset is already available in the format expected by the downstream DiskANN workflow, or when writing a sector-aligned index with externally prepared quantized data.

The C++ API also supports loading DiskANN-style PQ codebooks before build:

1 #include <cuvs/neighbors/vamana.hpp>
2 
3 using namespace cuvs::neighbors;
4 
5 raft::device_resources res;
6 auto dataset = load_dataset();
7 vamana::index_params index_params;
8 
9 index_params.codebooks =
10     vamana::deserialize_codebooks("/tmp/diskann/pivots", dataset.extent(1));
11 
12 auto index = vamana::build(res, index_params, dataset);
13 vamana::serialize(
14     res,
15     "/tmp/cuvs-vamana/index",
16     index,
17     false,
18     true);

Configuration parameters

Build parameters

Name	Default	Description
`metric`	`L2Expanded` / `sqeuclidean`	Distance metric used during graph construction. Vamana currently supports L2-style metrics.
`metric_arg`	`2.0`	Extra argument for metrics that need one, such as Minkowski distance.
`graph_degree`	`32`	Maximum number of outgoing graph edges kept per vector. This corresponds to the `R` parameter in Vamana literature.
`visited_size`	`64`	Maximum number of visited nodes saved during each graph traversal. This loosely corresponds to the `L` parameter in Vamana literature.
`vamana_iters`	`1.0`	Number of full insertion passes over the dataset. More passes can improve graph quality, but increase build time.
`alpha`	`1.2`	Pruning parameter. Larger values usually prune less aggressively and can improve recall at higher graph cost.
`max_fraction`	`0.06`	Maximum fraction of the dataset inserted in one batch. Larger batches can build faster but may reduce graph quality.
`batch_base`	`2.0`	Growth factor for insertion batch sizes.
`queue_size`	`127`	Candidate queue size used during graph traversal. This should be one less than a power of two and should be larger than `visited_size`.
`reverse_batchsize`	`1000000`	Maximum reverse-edge processing batch size. Lower values can reduce temporary memory during reverse-edge processing.
`codebooks`	None	Optional C++ PQ codebook parameters loaded with `deserialize_codebooks` for DiskANN-compatible quantized output workflows.

Tuning

Tune graph_degree first. Larger values give each vector more outgoing edges, which can improve downstream DiskANN recall but increases graph memory, build time, and serialized index size.

Tune visited_size and queue_size together. Larger values let insertion searches explore more of the graph, which can improve graph quality but increases temporary memory and build cost. Keep queue_size larger than visited_size.

Increase vamana_iters when graph quality matters more than build time. Each extra iteration reinserts all vectors, so build time can increase substantially.

Use max_fraction, batch_base, and reverse_batchsize to manage build speed and temporary memory. Larger insertion batches are faster but can reduce graph quality. Smaller reverse-edge batches reduce peak memory at the cost of more processing chunks.

Memory footprint

Vamana memory has two main parts: the dataset and the graph. During build, the dataset must be available to the GPU. The final graph stores up to graph_degree neighbor IDs for every vector.

To keep the formulas readable, this section uses short symbols. All estimates are in bytes. The examples convert bytes to MiB by dividing by 1024 * 1024.

N: Number of database vectors, or rows in the dataset being indexed.
D: Vector dimension, or number of values in each vector.
B: Bytes stored for each vector value. Use 4 for fp32, 1 for int8 and uint8, or the byte width of the dataset representation.
G: Final graph degree. This is the graph_degree build parameter.
V: Number of visited nodes kept during traversal. This is the visited_size build parameter.
Q_c: Candidate queue size. This is the queue_size build parameter.
F: Maximum insertion batch fraction. This is the max_fraction build parameter.
R_b: Reverse-edge processing batch size. This is the reverse_batchsize build parameter.
S_idx: Bytes per graph neighbor ID. Vamana uses 32-bit graph IDs, so this is usually 4.

The named terms in the formulas are also memory sizes:

dataset_size: Device memory used by vectors during build.
graph_size: Device memory used by final graph neighbor IDs.
build_batch_size: Maximum number of vectors inserted in one batch.
reverse_batch_size: Maximum number of vectors processed in one reverse-edge batch.
insertion_scratch_size: Temporary memory used while inserting one batch.
reverse_scratch_size: Temporary memory used while processing reverse edges.
build_peak: Approximate peak device memory during build.
host_serialize_size: Host memory used while serializing.
metadata_size: Small fixed index metadata. It is usually negligible compared with the dataset and graph, but include a measured value if exact accounting is required.

Scratch and maximum vectors

The formulas below include the major insertion and reverse-edge scratch buffers. Additional scratch comes from allocator padding, graph update temporaries, CUDA library workspaces, and memory held by the active memory resource. Use H = 0.25 for build estimates. If you can measure a representative smaller run, use:

H_{\text{measured}} = \frac{\text{observed\_peak} - \text{formula\_without\_scratch}} {\text{formula\_without\_scratch}}

Then set:

M_{\text{usable}} = (M_{\text{free}} - M_{\text{other}}) \cdot (1 - H)

The capacity variables in this subsection are:

M_free: Free memory in the relevant memory space before the operation starts. Use device memory for GPU-resident formulas and host memory for formulas explicitly marked as host memory.
M_other: Memory reserved for arrays, memory pools, concurrent work, or application buffers that are not included in the formula.
H: Scratch headroom fraction reserved for temporary buffers and allocator overhead.
M_usable: Memory budget left for the formula after subtracting M_other and reserving headroom.
observed_peak: Peak memory observed during a smaller representative run.
formula_without_scratch: Value of the selected peak formula with explicit scratch terms removed and without applying headroom.
peak_without_scratch(count): The selected peak formula rewritten as a function of the count being estimated, excluding scratch and headroom. The count is usually N for rows or vectors and B for K-selection batch rows.
B_per_row / B_per_vector: Bytes added by one more row or vector in the selected formula. For linear formulas, add the coefficients of the count being estimated after fixed values such as D, K, Q, and L are substituted.
B_fixed: Bytes in the selected formula that do not change with the estimated count, such as codebooks, centroids, fixed query batches, capped training buffers, or metadata.
N_max / B_max: Estimated largest row, vector, or batch-row count that fits in M_usable.

For fixed D, G, V, Q_c, F, and R_b, the build peak is mostly linear in N until reverse_batch_size reaches R_b. Solve the full max(...) expression or use the linear shortcut in the active region:

N_{\max} = \left\lfloor \frac{M_{\text{usable}} - B_{\text{fixed}}} {B_{\text{per\_vector}}} \right\rfloor

If reverse_batch_size = min(N, R_b) is capped, treat the reverse scratch term as fixed after N >= R_b.

Baseline memory after build

The baseline device footprint after index construction is:

\begin{aligned} \text{dataset\_size} &= N \times D \times B \end{aligned}

\begin{aligned} \text{graph\_size} &= N \times G \times S_{\text{idx}} \end{aligned}

\begin{aligned} \text{index\_size} &\approx \text{dataset\_size} \\ &\quad + \text{graph\_size} \\ &\quad + \text{metadata\_size} \end{aligned}

Example (N = 1e6, D = 128, fp32, graph_degree = 64, IdxT = uint32):

dataset_size = 512,000,000 B = 488.28 MiB
graph_size = 256,000,000 B = 244.14 MiB
index_size ~= 732.42 MiB + small metadata

Build peak memory usage

Vamana build inserts vectors in batches and also creates reverse edges. These temporary buffers are affected by max_fraction, queue_size, visited_size, graph_degree, and reverse_batchsize.

The maximum insertion batch size is:

\begin{aligned} \text{build\_batch\_size} &= \left\lceil F \times N \right\rceil \end{aligned}

The reverse-edge processing batch size is:

\begin{aligned} \text{reverse\_batch\_size} &= \min(N, R_b) \end{aligned}

A practical insertion scratch estimate is:

\begin{aligned} \text{insertion\_scratch\_size} &\approx \text{build\_batch\_size} \times (Q_c + V + G) \times S_{\text{idx}} \end{aligned}

A practical reverse-edge scratch estimate is:

\begin{aligned} \text{reverse\_scratch\_size} &\approx \text{reverse\_batch\_size} \times (2 + G) \times S_{\text{idx}} \end{aligned}

The approximate build peak is:

\begin{aligned} \text{build\_peak} &\approx \text{dataset\_size} \\ &\quad + \text{graph\_size} \\ &\quad + \max\!\big( \text{insertion\_scratch\_size}, \\ &\qquad\qquad \text{reverse\_scratch\_size} \big) \end{aligned}

Example (N = 1e6, D = 128, fp32, G = 64, V = 128, Q_c = 255, F = 0.06, R_b = 1e6, S_idx = 4):

build_batch_size = 60,000
insertion_scratch_size = 107,280,000 B = 102.31 MiB
reverse_scratch_size = 264,000,000 B = 251.77 MiB
build_peak ~= 984.19 MiB

Lower reverse_batchsize if reverse-edge processing dominates peak memory. Lower max_fraction, visited_size, or queue_size if insertion scratch dominates.

Serialization memory usage

Serialization writes a DiskANN-compatible index. The graph must be available on the host while writing. If include_dataset = true, the dataset is also included in the serialized output.

\begin{aligned} \text{host\_graph\_size} &= N \times G \times S_{\text{idx}} \end{aligned}

\begin{aligned} \text{host\_dataset\_size} &= \begin{cases} N \times D \times B, & \text{include dataset} \\ 0, & \text{do not include dataset} \end{cases} \end{aligned}

\begin{aligned} \text{host\_serialize\_size} &\approx \text{host\_graph\_size} \\ &\quad + \text{host\_dataset\_size} \end{aligned}

The serialized files can be smaller than the fixed-degree in-memory graph because DiskANN-style output can store a compact graph representation.