Memory Management
NVIDIA cuVS uses RAPIDS Memory Manager (RMM) through RAFT so GPU algorithms can allocate temporary buffers, output arrays, and staging memory through one configurable memory layer. RMM helps NVIDIA cuVS interoperate with the broader GPU library ecosystem that either uses RMM directly or provides RMM adaptors, including RAPIDS libraries, PyTorch, CuPy, Faiss, and TensorFlow. This lets applications share memory resources and memory allocations across library boundaries without unnecessary copies.
The most common choice is to configure a device memory pool before creating NVIDIA cuVS resources or allocating device arrays. Pooling avoids repeated cudaMalloc and cudaFree calls, which can synchronize the device and add allocator overhead to workloads with many temporary buffers.
Example API Usage
C resources API | Java provider API | Go memory API
C, Java, Go, and Rust configure RMM through NVIDIA cuVS wrappers over the C API. C++ and Python applications usually use RMM directly because RMM is part of the RAPIDS stack used by RAFT and NVIDIA cuVS.
Setting a device pool
Use a pool memory resource when a workload repeatedly builds indexes, searches in batches, runs clustering iterations, or allocates many temporary buffers. Configure the pool early, then keep it alive until all allocations that use it are complete.
C
C++
Python
Java
Rust
Go
The C, Java, Rust, and Go examples change the current device resource through the NVIDIA cuVS C API. This has process-wide effect for the current device, so configure it before allocating long-lived objects and reset it only after those objects are destroyed.
Allocating device memory
Use device memory for GPU-resident inputs, outputs, indexes, and scratch buffers. The default RMM device resource allocates CUDA device memory directly. A pool resource can sit above that default resource and serve the same allocations from a cached block of memory.
For examples of passing dense arrays into NVIDIA cuVS APIs across different languages and libraries, see Using dense arrays in cuVS APIs in the Dense Arrays guide.
C
C++
Python
Most users do not allocate raw buffers directly. NVIDIA cuVS APIs typically accept matrices, tensors, or language-native array wrappers, and those objects allocate through the active RMM resource underneath.
Allocating pinned host memory
Pinned host memory is page-locked CPU memory. It is useful when data must be copied between CPU and GPU asynchronously, or when GPU kernels and CPU code need fast shared access to small host-side coordination buffers.
Use it selectively. Pinned memory is a limited system resource, and overusing it can reduce host-memory flexibility for the operating system.
C
C++
Python
Java NVIDIA cuVS resources use pinned host buffers internally for batched transfers. Most Java users should rely on CuVSResources and matrix builders instead of managing pinned memory directly.
Using managed memory
Managed memory can simplify workflows where data may move between CPU and GPU address spaces or where an application wants a larger unified allocation model. It is useful for prototyping and some oversubscription workflows, but it can introduce page migration overhead. For performance-sensitive paths, benchmark managed memory against device memory.
C
C++
Python
Java
Rust
Go
How Memory Management works
RMM separates allocation policy from algorithm code. An NVIDIA cuVS algorithm asks RAFT for memory through the active resource, and the resource decides whether the allocation comes from direct device memory, a pool, managed memory, or a host allocation path.
This design gives users four practical benefits:
- Allocation behavior can be tuned without changing NVIDIA cuVS API calls.
- NVIDIA cuVS can share memory resources and allocations with other RMM-aware libraries, including RAPIDS libraries, PyTorch, CuPy, Faiss, and TensorFlow.
- Data can move through compatible GPU libraries without extra copies solely to cross library boundaries.
- Temporary allocations can be pooled to reduce allocator synchronization and fragmentation.
The active device resource should be configured before creating long-lived arrays, indexes, or NVIDIA cuVS resources. Changing the resource while live allocations still exist can make ownership hard to reason about, especially in applications that use several GPU libraries at once.
When to use each resource
Configuration choices
Start with a device pool when a workload is allocation-heavy. For C, Java, Rust, and Go wrappers, the initial and maximum pool sizes are expressed as percentages of free device memory at configuration time.
Larger initial pools reduce the chance of later upstream allocations, but reserve more memory immediately. Larger maximum pools allow more growth, but can compete with other applications or GPU libraries on the same device.
For multi-GPU workloads, configure memory on each participating GPU before running the algorithm. See the Multi-GPU guide for resource initialization patterns.
Practical guidance
Use a pool for repeatable performance measurements. Allocator behavior can otherwise appear as noise in benchmark results.
Keep the memory resource alive for at least as long as the allocations that use it. In C++ this means the pool object must outlive arrays allocated from it. In C, Java, Rust, and Go this means resetting the NVIDIA cuVS memory resource after NVIDIA cuVS objects have been destroyed.
Avoid changing memory resources in the middle of a workflow unless the application has a clear ownership boundary. The safest pattern is to configure once, allocate and run NVIDIA cuVS work, destroy NVIDIA cuVS objects, then reset.