Vamana
Vamana is a graph-building algorithm used by DiskANN. Think of every vector as a point, and think of the graph as a map that connects each point to nearby points. DiskANN uses that map to search very large indexes efficiently, including indexes that are stored on SSD.
cuVS provides a GPU-accelerated Vamana build path. It builds the graph on the GPU, then serializes it in a format compatible with the open-source DiskANN library for CPU search.
Vamana works well when you want to build large DiskANN-compatible graph indexes quickly on the GPU and use them in CPU or SSD-backed DiskANN search workflows.
Example API Usage
C API | C++ API | Python API | Rust API
Vamana currently supports build and serialize operations in cuVS. Search is performed by loading the serialized index with DiskANN. Java and Go do not currently expose standalone Vamana bindings.
Building an index
C
C++
Python
Rust
Serializing an index
C
C++
Python
Rust
Loading a serialized index
cuVS Vamana writes DiskANN-compatible files but does not currently expose a Vamana deserialization or search API. Load the serialized output with DiskANN or another DiskANN-compatible search layer.
How Vamana works
Vamana builds a directed graph over the dataset. Each vector keeps up to graph_degree outgoing edges to other vectors.
At a high level, the builder:
- Starts from a medoid vector, which is a central entry point into the graph.
- Inserts vectors in batches.
- Uses graph traversal to find candidate neighbors for each inserted vector.
- Adds forward and reverse edges.
- Prunes edges so each vector keeps a compact set of useful neighbors.
The visited_size and queue_size parameters control how much of the graph each insertion can explore. The alpha parameter controls pruning aggressiveness.
When to use Vamana
Use Vamana when you need a DiskANN-compatible graph and want GPU-accelerated build.
Vamana is a good fit for very large datasets, SSD-backed search workflows, and hybrid workflows where a GPU-built graph is converted to CPU for DiskANN search.
Avoid Vamana when you need an in-cuVS search API today. In that case, use CAGRA for GPU graph search, or use the serialized Vamana output with DiskANN for CPU search.
Interoperability with CPU DiskANN
The Vamana serialize APIs write files in a DiskANN-compatible format. This lets cuVS build the graph quickly on the GPU, then lets DiskANN load the serialized files for CPU search.
Set include_dataset = true when the serialized index should include the dataset. Set it to false when the dataset is already available in the format expected by the downstream DiskANN workflow, or when writing a sector-aligned index with externally prepared quantized data.
The C++ API also supports loading DiskANN-style PQ codebooks before build:
Configuration parameters
Build parameters
Tuning
Tune graph_degree first. Larger values give each vector more outgoing edges, which can improve downstream DiskANN recall but increases graph memory, build time, and serialized index size.
Tune visited_size and queue_size together. Larger values let insertion searches explore more of the graph, which can improve graph quality but increases temporary memory and build cost. Keep queue_size larger than visited_size.
Increase vamana_iters when graph quality matters more than build time. Each extra iteration reinserts all vectors, so build time can increase substantially.
Use max_fraction, batch_base, and reverse_batchsize to manage build speed and temporary memory. Larger insertion batches are faster but can reduce graph quality. Smaller reverse-edge batches reduce peak memory at the cost of more processing chunks.
Memory footprint
Vamana memory has two main parts: the dataset and the graph. During build, the dataset must be available to the GPU. The final graph stores up to graph_degree neighbor IDs for every vector.
To keep the formulas readable, this section uses short symbols. All estimates are in bytes. The examples convert bytes to MiB by dividing by 1024 * 1024.
N: Number of database vectors, or rows in the dataset being indexed.D: Vector dimension, or number of values in each vector.B: Bytes stored for each vector value. Use4for fp32,1for int8 and uint8, or the byte width of the dataset representation.G: Final graph degree. This is thegraph_degreebuild parameter.V: Number of visited nodes kept during traversal. This is thevisited_sizebuild parameter.Q_c: Candidate queue size. This is thequeue_sizebuild parameter.F: Maximum insertion batch fraction. This is themax_fractionbuild parameter.R_b: Reverse-edge processing batch size. This is thereverse_batchsizebuild parameter.S_idx: Bytes per graph neighbor ID. Vamana uses 32-bit graph IDs, so this is usually4.
The named terms in the formulas are also memory sizes:
dataset_size: Device memory used by vectors during build.graph_size: Device memory used by final graph neighbor IDs.build_batch_size: Maximum number of vectors inserted in one batch.reverse_batch_size: Maximum number of vectors processed in one reverse-edge batch.insertion_scratch_size: Temporary memory used while inserting one batch.reverse_scratch_size: Temporary memory used while processing reverse edges.build_peak: Approximate peak device memory during build.host_serialize_size: Host memory used while serializing.metadata_size: Small fixed index metadata. It is usually negligible compared with the dataset and graph, but include a measured value if exact accounting is required.
Scratch and maximum vectors
The formulas below include the major insertion and reverse-edge scratch buffers. Additional scratch comes from allocator padding, graph update temporaries, CUDA library workspaces, and memory held by the active memory resource. Use H = 0.25 for build estimates. If you can measure a representative smaller run, use:
Then set:
The capacity variables in this subsection are:
M_free: Free memory in the relevant memory space before the operation starts. Use device memory for GPU-resident formulas and host memory for formulas explicitly marked as host memory.M_other: Memory reserved for arrays, memory pools, concurrent work, or application buffers that are not included in the formula.H: Scratch headroom fraction reserved for temporary buffers and allocator overhead.M_usable: Memory budget left for the formula after subtractingM_otherand reserving headroom.observed_peak: Peak memory observed during a smaller representative run.formula_without_scratch: Value of the selected peak formula with explicitscratchterms removed and without applying headroom.peak_without_scratch(count): The selected peak formula rewritten as a function of the count being estimated, excluding scratch and headroom. The count is usuallyNfor rows or vectors andBfor K-selection batch rows.B_per_row/B_per_vector: Bytes added by one more row or vector in the selected formula. For linear formulas, add the coefficients of the count being estimated after fixed values such asD,K,Q, andLare substituted.B_fixed: Bytes in the selected formula that do not change with the estimated count, such as codebooks, centroids, fixed query batches, capped training buffers, or metadata.N_max/B_max: Estimated largest row, vector, or batch-row count that fits inM_usable.
For fixed D, G, V, Q_c, F, and R_b, the build peak is mostly linear in N until reverse_batch_size reaches R_b. Solve the full max(...) expression or use the linear shortcut in the active region:
If reverse_batch_size = min(N, R_b) is capped, treat the reverse scratch term as fixed after N >= R_b.
Baseline memory after build
The baseline device footprint after index construction is:
Example (N = 1e6, D = 128, fp32, graph_degree = 64, IdxT = uint32):
dataset_size = 512,000,000 B = 488.28 MiBgraph_size = 256,000,000 B = 244.14 MiBindex_size ~= 732.42 MiB + small metadata
Build peak memory usage
Vamana build inserts vectors in batches and also creates reverse edges. These temporary buffers are affected by max_fraction, queue_size, visited_size, graph_degree, and reverse_batchsize.
The maximum insertion batch size is:
The reverse-edge processing batch size is:
A practical insertion scratch estimate is:
A practical reverse-edge scratch estimate is:
The approximate build peak is:
Example (N = 1e6, D = 128, fp32, G = 64, V = 128, Q_c = 255, F = 0.06, R_b = 1e6, S_idx = 4):
build_batch_size = 60,000insertion_scratch_size = 107,280,000 B = 102.31 MiBreverse_scratch_size = 264,000,000 B = 251.77 MiBbuild_peak ~= 984.19 MiB
Lower reverse_batchsize if reverse-edge processing dominates peak memory. Lower max_fraction, visited_size, or queue_size if insertion scratch dominates.
Serialization memory usage
Serialization writes a DiskANN-compatible index. The graph must be available on the host while writing. If include_dataset = true, the dataset is also included in the serialized output.
The serialized files can be smaller than the fixed-degree in-memory graph because DiskANN-style output can store a compact graph representation.