HNSW

View as Markdown

Python module: cuvs.neighbors.hnsw

AceParams

1cdef class AceParams

Parameters for ACE (Augmented Core Extraction) graph build for HNSW.

ACE enables building HNSW indices for datasets too large to fit in GPU memory by partitioning the dataset and building sub-indices for each partition independently.

Parameters

NameTypeDescription
npartitionsint, default = 0 (optional)Number of partitions for ACE partitioned build. When set to 0 (default), the number of partitions is automatically derived based on available host and GPU memory to maximize partition size while ensuring the build fits in memory.

Small values might improve recall but potentially degrade performance and increase memory usage. Partitions should not be too small to prevent issues in KNN graph construction. The partition size is on average 2 * (n_rows / npartitions) * dim * sizeof(T). 2 is because of the core and augmented vectors. Please account for imbalance in the partition sizes (up to 3x in our tests).

If the specified number of partitions results in partitions that exceed available memory, the value will be automatically increased to fit memory constraints and a warning will be issued.
build_dirstring, default = "/tmp/hnsw_ace_build" (optional)Directory to store ACE build artifacts (KNN graph, optimized graph). Used when use_disk is true or when the graph does not fit in memory.
use_diskbool, default = False (optional)Whether to use disk-based storage for ACE build. When true, enables disk-based operations for memory-efficient graph construction.
max_host_memory_gbfloat, default = 0 (optional)Maximum host memory to use for ACE build in GiB. When set to 0 (default), uses available host memory. Useful for testing or when running alongside other memory-intensive processes.
max_gpu_memory_gbfloat, default = 0 (optional)Maximum GPU memory to use for ACE build in GiB. When set to 0 (default), uses available GPU memory. Useful for testing or when running alongside other memory-intensive processes.

Constructor

1def __init__(self, *, npartitions=0, build_dir="/tmp/hnsw_ace_build", use_disk=False, max_host_memory_gb=0, max_gpu_memory_gb=0)

Members

NameKind
npartitionsproperty
build_dirproperty
use_diskproperty
max_host_memory_gbproperty
max_gpu_memory_gbproperty

npartitions

1def npartitions(self)

build_dir

1def build_dir(self)

use_disk

1def use_disk(self)

max_host_memory_gb

1def max_host_memory_gb(self)

max_gpu_memory_gb

1def max_gpu_memory_gb(self)

IndexParams

1cdef class IndexParams

Parameters to build index for HNSW nearest neighbor search

Parameters

NameTypeDescription
hierarchystring, default = "gpu" (optional)The hierarchy of the HNSW index. Valid values are [“none”, “cpu”, “gpu”].
- “none”: No hierarchy is built.
- “cpu”: Hierarchy is built using CPU.
- “gpu”: Hierarchy is built using GPU.
ef_constructionint, default = 200 (optional)Maximum number of candidate list size used during construction when hierarchy is cpu.
num_threadsint, default = 0 (optional)Number of CPU threads used to increase construction parallelism when hierarchy is cpu or gpu. When the value is 0, the number of threads is automatically determined to the maximum number of threads available. NOTE: When hierarchy is gpu, while the majority of the work is done on the GPU, initialization of the HNSW index itself and some other work is parallelized with the help of CPU threads.
Mint, default = 32 (optional)HNSW M parameter: number of bi-directional links per node (used when building with ACE). graph_degree = m * 2, intermediate_graph_degree = m * 3.
metricstring, default = "sqeuclidean" (optional)Distance metric to use. Valid values: [“sqeuclidean”, “inner_product”]
ace_paramsAceParams, default = None (optional)ACE parameters for building HNSW index using ACE algorithm. If set, enables the build() function to use ACE for index construction.

Constructor

1def __init__(self, *, hierarchy="gpu", ef_construction=200, num_threads=0, M=32, metric="sqeuclidean", ace_params=None)

Members

NameKind
hierarchyproperty
ef_constructionproperty
num_threadsproperty
mproperty
ace_paramsproperty

hierarchy

1def hierarchy(self)

ef_construction

1def ef_construction(self)

num_threads

1def num_threads(self)

m

1def m(self)

ace_params

1def ace_params(self)

Index

1cdef class Index

HNSW index object. This object stores the trained HNSW index state which can be used to perform nearest neighbors searches.

Members

NameKind
trainedproperty

trained

1def trained(self)

ExtendParams

1cdef class ExtendParams

Parameters to extend the HNSW index with new data

Parameters

NameTypeDescription
num_threadsint, default = 0 (optional)Number of CPU threads used to increase construction parallelism. When set to 0, the number of threads is automatically determined.

Constructor

1def __init__(self, *, num_threads=0)

Members

NameKind
num_threadsproperty

num_threads

1def num_threads(self)

build

@auto_sync_resources

1def build(IndexParams index_params, dataset, resources=None)

Build an HNSW index using the ACE (Augmented Core Extraction) algorithm.

ACE enables building HNSW indices for datasets too large to fit in GPU memory by partitioning the dataset and building sub-indices for each partition independently.

NOTE: This function requires index_params.ace_params to be set with an instance of AceParams.

Parameters

NameTypeDescription
index_paramsIndexParamsParameters for the HNSW index with ACE configuration. Must have ace_params set.
datasetHost array interface compliant matrix shape (n_samples, dim)Supported dtype [float32, float16, int8, uint8]
resourcescuvs.common.Resources, optional

Returns

NameTypeDescription
indexIndexTrained HNSW index ready for search.

Examples

1>>> import numpy as np
2>>> from cuvs.neighbors import hnsw
3>>>
4>>> n_samples = 50000
5>>> n_features = 50
6>>> dataset = np.random.random_sample((n_samples, n_features),
7... dtype=np.float32)
8>>>
9>>> # Create ACE parameters
10>>> ace_params = hnsw.AceParams(
11... npartitions=4,
12... use_disk=True,
13... build_dir="/tmp/hnsw_ace_build"
14... )
15>>>
16>>> # Create index parameters with ACE
17>>> index_params = hnsw.IndexParams(
18... hierarchy="gpu",
19... ace_params=ace_params,
20... ef_construction=120,
21... M=32,
22... metric="sqeuclidean"
23... )
24>>>
25>>> # Build the index
26>>> index = hnsw.build(index_params, dataset)
27>>>
28>>> # Search the index
29>>> queries = np.random.random_sample((10, n_features), dtype=np.float32)
30>>> distances, neighbors = hnsw.search(
31... hnsw.SearchParams(ef=200),
32... index,
33... queries,
34... k=10
35... )

extend

@auto_sync_resources

1def extend(ExtendParams extend_params, Index index, data, resources=None)

Extends the HNSW index with new data.

Parameters

NameTypeDescription
extend_paramsExtendParams
indexIndexTrained HNSW index.
dataHost array interface compliant matrix shape (n_samples, dim)Supported dtype [float32, float16, int8, uint8]
resourcescuvs.common.Resources, optional

Examples

1>>> import numpy as np
2>>> from cuvs.neighbors import hnsw, cagra
3>>>
4>>> n_samples = 50000
5>>> n_features = 50
6>>> dataset = np.random.random_sample((n_samples, n_features))
7>>>
8>>> # Build index
9>>> index = cagra.build(hnsw.IndexParams(), dataset)
10>>> # Load index
11>>> hnsw_index = hnsw.from_cagra(hnsw.IndexParams(hierarchy="cpu"), index)
12>>> # Extend the index with new data
13>>> new_data = np.random.random_sample((n_samples, n_features))
14>>> hnsw.extend(hnsw.ExtendParams(), hnsw_index, new_data)

SearchParams

1cdef class SearchParams

HNSW search parameters

Parameters

NameTypeDescription
efint, default = 200Maximum number of candidate list size used during search.
num_threadsint, default = 0Number of CPU threads used to increase search parallelism. When set to 0, the number of threads is automatically determined using OpenMP’s omp_get_max_threads().

Constructor

1def __init__(self, *, ef=200, num_threads=0)

Members

NameKind
efproperty
num_threadsproperty

ef

1def ef(self)

num_threads

1def num_threads(self)

load

@auto_sync_resources

1def load(IndexParams index_params, filename, dim, dtype, metric="sqeuclidean", resources=None)

Loads an HNSW index. If the index was constructed with hnsw.IndexParams(hierarchy="none"), then the loaded index is immutable and can only be searched by the hnswlib wrapper in cuVS, as the format is not compatible with the original hnswlib. However, if the index was constructed with hnsw.IndexParams(hierarchy="cpu"), then the loaded index is mutable and compatible with the original hnswlib.

Saving / loading the index is experimental. The serialization format is subject to change, therefore loading an index saved with a previous version of cuVS is not guaranteed to work.

Parameters

NameTypeDescription
index_paramsIndexParamsParameters that were used to convert CAGRA index to HNSW index.
filenamestringName of the file.
dimintDimensions of the training dataest
dtypenp.dtype of the saved indexValid values for dtype: [np.float32, np.byte, np.ubyte]
metricstring denoting the metric type, default="sqeuclidean"Valid values for metric: [“sqeuclidean”, “inner_product”], where
- sqeuclidean is the euclidean distance without the square root operation, i.e.: distance(a,b) = \sum_i (a_i - b_i)^2,
- inner_product distance is defined as distance(a, b) = \sum_i a_i * b_i.
resourcescuvs.common.Resources, optional

Returns

NameTypeDescription
indexHnswIndex

Examples

1>>> import cupy as cp
2>>> from cuvs.neighbors import cagra
3>>> from cuvs.neighbors import hnsw
4>>> n_samples = 50000
5>>> n_features = 50
6>>> dataset = cp.random.random_sample((n_samples, n_features),
7... dtype=cp.float32)
8>>> # Build index
9>>> index = cagra.build(cagra.IndexParams(), dataset)
10>>> # Serialize the CAGRA index to hnswlib base layer only index format
11>>> hnsw.save("my_index.bin", index)
12>>> index = hnsw.load("my_index.bin", n_features, np.float32,
13... "sqeuclidean")

save

@auto_sync_resources

1def save(filename, Index index, resources=None)

Saves the CAGRA index to a file as an hnswlib index. If the index was constructed with hnsw.IndexParams(hierarchy="none"), then the saved index is immutable and can only be searched by the hnswlib wrapper in cuVS, as the format is not compatible with the original hnswlib. However, if the index was constructed with hnsw.IndexParams(hierarchy="cpu"), then the saved index is mutable and compatible with the original hnswlib.

Saving / loading the index is experimental. The serialization format is subject to change.

Parameters

NameTypeDescription
filenamestringName of the file.
indexIndexTrained HNSW index.
resourcescuvs.common.Resources, optional

Examples

1>>> import cupy as cp
2>>> from cuvs.neighbors import cagra
3>>> n_samples = 50000
4>>> n_features = 50
5>>> dataset = cp.random.random_sample((n_samples, n_features),
6... dtype=cp.float32)
7>>> # Build index
8>>> cagra_index = cagra.build(cagra.IndexParams(), dataset)
9>>> # Serialize and deserialize the cagra index built
10>>> hnsw_index = hnsw.from_cagra(hnsw.IndexParams(), cagra_index)
11>>> hnsw.save("my_index.bin", hnsw_index)

@auto_sync_resources @auto_convert_output

1def search(SearchParams search_params, Index index, queries, k, neighbors=None, distances=None, resources=None)

Find the k nearest neighbors for each query.

Parameters

NameTypeDescription
search_paramsSearchParams
indexIndexTrained HNSW index.
queriesCPU array interface compliant matrix shape (n_samples, dim)Supported dtype [float, int]
kintThe number of neighbors.
neighborsOptional CPU array interface compliant matrix shape(n_queries, k), dtype uint64_t. If supplied, neighbor indices will be written here in-place. (default None)
distancesOptional CPU array interface compliant matrix shape(n_queries, k) If supplied, the distances to the neighbors will be written here in-place. (default None)
resourcescuvs.common.Resources, optional

Examples

1>>> import cupy as cp
2>>> from cuvs.neighbors import cagra, hnsw
3>>> n_samples = 50000
4>>> n_features = 50
5>>> n_queries = 1000
6>>> dataset = cp.random.random_sample((n_samples, n_features),
7... dtype=cp.float32)
8>>> # Build index
9>>> index = cagra.build(cagra.IndexParams(), dataset)
10>>> # Search using the built index
11>>> queries = cp.random.random_sample((n_queries, n_features),
12... dtype=cp.float32)
13>>> k = 10
14>>> search_params = hnsw.SearchParams(
15... ef=200,
16... num_threads=0
17... )
18>>> # Convert CAGRA index to HNSW
19>>> hnsw_index = hnsw.from_cagra(hnsw.IndexParams(), index)
20>>> # Using a pooling allocator reduces overhead of temporary array
21>>> # creation during search. This is useful if multiple searches
22>>> # are performed with same query size.
23>>> distances, neighbors = hnsw.search(search_params, index, queries,
24... k)
25>>> neighbors = cp.asarray(neighbors)
26>>> distances = cp.asarray(distances)

from_cagra

@auto_sync_resources

1def from_cagra(IndexParams index_params, cagra.Index cagra_index, temporary_index_path=None, resources=None)

Returns an HNSW index from a CAGRA index.

NOTE: When index_params.hierarchy is:

  1. NONE: This method uses the filesystem to write the CAGRA index in /tmp/<random_number>.bin before reading it as an hnswlib index, then deleting the temporary file. The returned index is immutable and can only be searched by the hnswlib wrapper in cuVS, as the format is not compatible with the original hnswlib.
  2. CPU: The returned index is mutable and can be extended with additional vectors. The serialized index is also compatible with the original hnswlib library.

Saving / loading the index is experimental. The serialization format is subject to change.

Parameters

NameTypeDescription
index_paramsIndexParamsParameters to convert the CAGRA index to HNSW index.
cagra_indexcagra.IndexTrained CAGRA index.
temporary_index_pathstring, default = NonePath to save the temporary index file. If None, the temporary file will be saved in /tmp/<random_number>.bin.
resourcescuvs.common.Resources, optional

Examples

1>>> import cupy as cp
2>>> from cuvs.neighbors import cagra
3>>> from cuvs.neighbors import hnsw
4>>> n_samples = 50000
5>>> n_features = 50
6>>> dataset = cp.random.random_sample((n_samples, n_features),
7... dtype=cp.float32)
8>>> # Build index
9>>> index = cagra.build(cagra.IndexParams(), dataset)
10>>> # Serialize the CAGRA index to hnswlib base layer only index format
11>>> hnsw_index = hnsw.from_cagra(hnsw.IndexParams(), index)