Multi-GPU | cuVS

NVIDIA cuVS multi-GPU APIs use RAFT resources to coordinate work across GPUs. The resource object owns CUDA streams, memory resources, and communication state, so NVIDIA cuVS algorithms can be written against one interface and then run in different distributed environments.

The RAFT communicator is the part of that interface that handles rank metadata and collective communication. This lets an algorithm use the same communication pattern whether the surrounding application is launched with MPI, Dask, Ray, or another distributed runtime. The runtime is still responsible for starting workers, assigning ranks, and placing data shards; RAFT gives NVIDIA cuVS a common way to communicate once those pieces exist.

NCCL is the primary communicator backend used by NVIDIA cuVS multi-GPU algorithms. Most users interact with NCCL through one of two paths:

Single-node multi-GPU resources, where one process controls multiple GPUs on the same node.
Multi-node multi-GPU resources, where each process owns a rank and attaches an externally created NCCL communicator to a RAFT handle.

For multi-GPU vector indexes, see the Multi-GPU indexing guide.

Example API Usage

C resources API | Python resources API

The examples below cover the high-level NVIDIA cuVS language surfaces that currently expose multi-GPU resource initialization: C, C++, and Python. Rust, Go, and Java do not currently expose matching high-level multi-GPU resource wrappers.

Single-node multi-GPU

Use the single-node path when one process can see and control all GPUs used by the operation. This is the simplest setup for one machine with multiple GPUs. In C++ this is raft::device_resources_snmg; in C and Python it is exposed through NVIDIA cuVS multi-GPU resources wrappers.

C

C++

Python

1 #include <cuvs/core/c_api.h>
2 
3 cuvsResources_t resources;
4 cuvsMultiGpuResourcesCreate(&resources);
5 
6 // Use resources with cuVS multi-GPU C APIs.
7 // For example, pass it to cuvsMultiGpuCagraBuild().
8 
9 cuvsMultiGpuResourcesDestroy(resources);

When an application should restrict NVIDIA cuVS to a subset of visible GPUs, use the device-id-specific resource constructor for that language:

C: cuvsMultiGpuResourcesCreateWithDeviceIds()
C++: raft::device_resources_snmg(std::vector<int>{...})
Python: MultiGpuResources(device_ids=[...])

Multi-node NCCL communicator

Use the multi-node path when each process controls one rank, often one GPU, and the application runtime provides launch, rank assignment, and data placement. This API is currently exposed only in C++. The application creates an ncclComm_t, attaches it to a RAFT handle, and then passes that handle to NVIDIA cuVS APIs that accept raft::resources.

C++

1 #include <cuvs/cluster/kmeans.hpp>
2 
3 #include <raft/comms/std_comms.hpp>
4 #include <raft/core/device_mdarray.hpp>
5 #include <raft/core/handle.hpp>
6 
7 #include <cuda_runtime_api.h>
8 #include <mpi.h>
9 #include <nccl.h>
10 
11 #include <optional>
12 
13 using namespace cuvs::cluster;
14 
15 int main(int argc, char** argv)
16 {
17   MPI_Init(&argc, &argv);
18 
19   int rank;
20   int world_size;
21   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
22   MPI_Comm_size(MPI_COMM_WORLD, &world_size);
23 
24   int device_count;
25   cudaGetDeviceCount(&device_count);
26   cudaSetDevice(rank % device_count);
27 
28   ncclUniqueId nccl_id;
29   if (rank == 0) { ncclGetUniqueId(&nccl_id); }
30   MPI_Bcast(&nccl_id, sizeof(nccl_id), MPI_BYTE, 0, MPI_COMM_WORLD);
31 
32   ncclComm_t nccl_comm;
33   ncclCommInitRank(&nccl_comm, world_size, nccl_id, rank);
34 
35   raft::handle_t handle;
36   raft::comms::build_comms_nccl_only(&handle, nccl_comm, world_size, rank);
37 
38   // Each rank owns one local shard on its GPU.
39   auto local_dataset = load_local_dataset<float, int>(rank);
40 
41   kmeans::params params;
42   params.n_clusters = 1024;
43 
44   auto centroids =
45       raft::make_device_matrix<float, int>(
46           handle, params.n_clusters, local_dataset.extent(1));
47 
48   float inertia;
49   int n_iter;
50 
51   kmeans::fit(handle,
52               params,
53               local_dataset.view(),
54               std::nullopt,
55               centroids.view(),
56               raft::make_host_scalar_view(&inertia),
57               raft::make_host_scalar_view(&n_iter));
58 
59   handle.sync_stream();
60 
61   ncclCommDestroy(nccl_comm);
62   MPI_Finalize();
63   return 0;
64 }

The example uses MPI only to launch ranks and broadcast the NCCL unique ID. A Ray, Dask, or service-based runtime can provide the same rank metadata and NCCL communicator setup through its own worker lifecycle.

RAFT communicator role

The communicator makes distributed NVIDIA cuVS code less tied to one scheduler. NVIDIA cuVS algorithms call collectives through RAFT resources instead of embedding MPI, Dask, or Ray-specific logic in the algorithm itself. This is what allows the same algorithm implementation to be reused in different deployment systems.

In practice, the communicator provides:

The rank id and world size for the current worker.
Collective operations used by distributed algorithms.
A common place for NVIDIA cuVS to find communication state alongside CUDA streams and memory resources.

NCCL is the communicator used for GPU collectives in NVIDIA cuVS. MPI, Ray, Dask, or another framework may still be used to launch workers, distribute data, and exchange the NCCL unique ID before the RAFT handle is initialized.

Choosing a setup

Setup	Languages	Typical use	Who creates communication state?
Single‑node multi‑GPU	C, C++, Python	One process uses several GPUs on one machine.	NVIDIA cuVS/RAFT creates resources for the visible devices or requested device ids.
Multi‑node multi‑GPU	C++	Many ranks run across one or more nodes.	The application runtime creates ranks and initializes NCCL.

Use single-node resources when all GPUs are local to one process. Use an explicit NCCL communicator when the work is already distributed across ranks, nodes, or worker processes.