NVIDIA cuVS APIs use a resources object to keep track of CUDA execution state that should be reused across related calls. In C++ this is usually raft::device_resources; in C and most language bindings it appears as a wrapper around cuvsResources_t.
For simple examples, the default resource behavior is usually enough. Create and pass an explicit resources object when you want to chain several operations together, control the CUDA stream used by NVIDIA cuVS, reuse expensive CUDA library handles, configure temporary memory, or keep setup costs out of repeated calls.
A CUDA stream is an ordered queue of GPU work. Kernel launches, copies, and library calls enqueued into the same stream run in order. Work in different streams can overlap when the GPU and workload allow it.
Most NVIDIA cuVS algorithms enqueue work on the stream stored in the resources object and return control to the host before that GPU work has necessarily finished. This is intentional: it lets users chain operations without forcing a synchronization point between every call.
Synchronization means waiting until queued GPU work has completed. CUDA provides explicit synchronization, and the CUDA Runtime API includes cudaStreamSynchronize() for waiting on one stream. NVIDIA cuVS resource wrappers expose stream synchronization helpers so users do not need to fetch the raw CUDA stream in common cases.
A stream pool is a small group of CUDA streams attached to a resources object. Some algorithms can use a stream pool to run independent pieces of work concurrently. Most users do not need to configure one unless an algorithm guide or benchmark suggests it.
Workspace memory is temporary memory used inside an algorithm. Configuring workspace resources can make allocation behavior more predictable for allocation-heavy workloads. See Memory Management for allocator choices such as device pools, pinned host memory, managed memory, and host memory.
C resources API | Python resources API | Java resources API | Rust resources API | Go resources API
Create one resources object near the beginning of a workflow and pass it to related NVIDIA cuVS calls. Reusing the same object keeps those calls ordered on the same CUDA stream and lets NVIDIA cuVS reuse expensive state.
Most NVIDIA cuVS algorithms do not synchronize before returning. This lets you enqueue several operations back-to-back on the same resources object, such as build, search, refine, and copy, without stopping the GPU between steps.
Synchronize explicitly before reading results on the host, measuring elapsed time, handing data to work on a different CUDA stream without another dependency, or destroying objects that may still be used by queued GPU work.
In Python, many APIs create and synchronize an internal resources object when you do not pass one. When you pass your own Resources, you are also taking responsibility for calling resources.sync() at the point where your application needs the results to be complete.
raft::resources and the resource wrappers built on top of it are not thread-safe for concurrent use. If an application uses multiple host threads, give each worker thread its own resources object. This keeps each thread’s CUDA stream, library handles, temporary memory, and queued work separate.
Do not share one resources object across threads unless the application uses its own locking and understands that the lock serializes access to that resource. For most workloads, one resources object per worker thread is simpler and faster.
Most users can let NVIDIA cuVS create and own the CUDA stream inside the resources object. Supplying an application stream is useful when NVIDIA cuVS work must be ordered with other CUDA kernels, copies, or library calls from the same application.
A stream pool is useful only when an algorithm has independent work that can run concurrently. For example, an algorithm might search several independent shards or launch separate preprocessing work on different streams. Start with the default behavior, then configure a small stream pool only when benchmarks show it helps.
Workspace resources control where algorithms allocate temporary buffers. The default behavior is usually fine for small examples, but production workloads can benefit from configuring workspace memory explicitly before arrays, indexes, or algorithm state are created.
NVIDIA cuVS uses two related workspace concepts:
For allocator choices such as device pools, pinned host memory, managed memory, and host memory, see the Memory Management guide.
The large workspace resource is the RAFT-side hook for what many applications treat as a big memory resource: memory reserved for especially large temporary buffers. This is useful in services or benchmarking harnesses that run workloads with very different temporary memory shapes.
Use the default resources behavior for simple one-off calls.
Create and reuse a resources object when a workflow has multiple related calls. Operations enqueued on the same resources object run in stream order, so you can chain work and synchronize once at the end.
Use one resources object per host thread. Resources are not thread-safe for concurrent use.
Synchronize explicitly before the host reads GPU results or before measuring runtime. Avoid synchronizing between every NVIDIA cuVS call unless the application actually needs the intermediate result on the host.
Configure memory resources, workspace resources, and stream pools before allocating inputs or building indexes. Changing resource configuration halfway through a workflow makes ownership and lifetime harder to reason about.
For multi-GPU workloads, choose the resource type first: raft::device_resources_snmg for single-node multi-GPU, or a RAFT resources object with an NCCL communicator for multi-node work. The Multi-GPU guide covers those setup patterns.