C++ Guidelines

This page collects the engineering conventions that keep NVIDIA cuVS APIs stable, predictable, and easy to maintain. Start with the Contributor Guide, then use this page when designing public APIs, writing CUDA/C++ implementation code, or preparing a change for review.

Local Development

Most NVIDIA cuVS changes can be developed directly in this repository. Cross-project CUDA/C++ work may also require a local RAFT build or temporary downstream pin.

If a consuming project supports source builds, pass CPM_raft_SOURCE=/path/to/raft/source to its CMake configuration. If the downstream project must pin a RAFT branch while related changes are under review, update the FORK and PINNED_TAG arguments to find_and_configure_raft, then revert that pin before the downstream change merges.

If source builds are not being used, install the local RAFT C++ artifacts into the consuming project’s environment before testing the downstream change.

Public Interface

General Guidelines

Public C++ APIs should be stateless wrappers around implementation code in a private detail namespace.

Expose only lightweight, predictable types:

Plain data structs used for parameters or metadata.
raft::resources, because it owns execution resources rather than algorithm state.
raft::span and raft::mdspan views for single- and multi-dimensional data.
std::optional for optional values instead of sentinel pointers.

Prefer references for required inputs. Reserve pointers for established output patterns and avoid exposing temporary implementation classes in public headers.

API Stability

Public APIs are consumed by multiple projects and should change carefully. Add new APIs before removing old ones, deprecate old entry points over a few releases, and avoid changing behavior in ways that downstream users cannot detect at compile time.

Stateless C++ APIs

Avoid public APIs that store algorithm state in non-POD wrapper objects:

1 class ivf_pq_float {
2   ivf_pq::index_params params_;
3   raft::resources const& res_;
4 
5  public:
6   ivf_pq_float(raft::resources const& res);
7 
8   void train(raft::device_matrix_view<const float, int64_t, raft::row_major> dataset);
9 
10   void search(raft::device_matrix_view<const float, int64_t, raft::row_major> queries,
11               raft::device_matrix_view<int64_t, int64_t, raft::row_major> neighbors,
12               raft::device_matrix_view<float, int64_t, raft::row_major> distances);
13 };

Prefer stateless, instantiated overloads for the supported type combinations. Template implementations can still live in detail, but public entry points should be concrete:

1 namespace cuvs::neighbors::ivf_pq {
2 
3 auto build(raft::resources const& res,
4            index_params const& params,
5            raft::device_matrix_view<const float, int64_t, raft::row_major> dataset)
6   -> index<int64_t>;
7 
8 void build(raft::resources const& res,
9            index_params const& params,
10            raft::device_matrix_view<const float, int64_t, raft::row_major> dataset,
11            index<int64_t>* idx);
12 
13 void search(raft::resources const& res,
14             search_params const& params,
15             index<int64_t> const& idx,
16             raft::device_matrix_view<const float, int64_t, raft::row_major> queries,
17             raft::device_matrix_view<int64_t, int64_t, raft::row_major> neighbors,
18             raft::device_matrix_view<float, int64_t, raft::row_major> distances);
19 
20 // Add supported variants, such as half or int8_t, as separate overloads.
21 }

Functions On State

When an API creates an index or model object, also expose stateless functions for persistence and transfer. Keep those functions in the same public namespace as the owning algorithm:

1 namespace cuvs::neighbors::ivf_pq {
2 
3 void serialize(raft::resources const& res, std::ostream& os, index<int64_t> const& index);
4 
5 void deserialize(raft::resources const& res, std::istream& is, index<int64_t>* index);
6 
7 }  // namespace cuvs::neighbors::ivf_pq

Working With Dense Arrays

Use RAFT array types consistently in public C++ APIs and implementation code. For user-facing examples of passing dense arrays into NVIDIA cuVS APIs, see Dense Arrays.

Keep data layout explicit. Most NVIDIA cuVS APIs expect row-major dense matrices unless the API says otherwise.

Prefer public API signatures that accept non-owning views, not owning arrays. This keeps NVIDIA cuVS functions flexible because callers can pass memory owned by RAFT, another RAPIDS library, or an application-specific allocator.

Prefer device_matrix_view and device_vector_view over raw pointers for public NVIDIA cuVS C++ APIs. Views carry shape, layout, memory-space, and constness information, making incorrect dimensions and accidental writes easier to catch.

Use owning arrays such as raft::device_matrix, raft::host_matrix, and raft::pinned_matrix when implementation code, tests, or examples need RAFT to allocate storage.

Synchronize appropriately before reading data on the host. Many NVIDIA cuVS operations enqueue GPU work asynchronously on the stream owned by raft::device_resources.

Use pinned arrays only when their transfer or coordination benefits matter. Ordinary host arrays are simpler and should be the default for CPU-only data.

Common Design Considerations

Use .hpp for headers that can be compiled by gcc against the CUDA runtime. Use .cuh when a header requires nvcc.
Keep public types simple. They should store state, not perform computation.
Document every public API with a clear summary, parameter descriptions, and a short usage example when helpful.
Before adding a primitive, check whether an existing primitive can be extended cleanly. Add a new public API only when the behavior is genuinely distinct.

Performance

Prefer small, explicit choices that avoid hidden overhead:

Use cudaDeviceGetAttribute instead of cudaDeviceGetProperties in performance-critical code. See the CUDA developer blog post on fast device property queries.
Reuse the stream pool on the provided raft::resources object instead of creating one raft::resources object per stream. See Threading Model and Resource Management.
Keep CPU work around GPU launches light. If host threads are used, they should coordinate CUDA streams, not perform heavy CPU computation.

Threading Model

NVIDIA cuVS algorithms should be safe to call from multiple host threads when each thread uses its own raft::resources instance. Treat raft::resources as the boundary for CUDA streams, memory resources, communication handles, and library handles.

Inside an algorithm, host threads are acceptable only when they help keep CUDA streams busy. Keep them bounded, prefer OpenMP, and make sure the algorithm still works when OpenMP is disabled.

1 #include <raft/core/resource/cuda_stream.hpp>
2 #include <raft/core/resource/cuda_stream_pool.hpp>
3 #include <raft/core/resources.hpp>
4 
5 void run_batches(raft::resources const& res, int n_batches)
6 {
7   auto main_stream = raft::resource::get_cuda_stream(res);
8   raft::resource::sync_stream(res, main_stream);
9 
10 #pragma omp parallel for
11   for (int i = 0; i < n_batches; ++i) {
12     auto stream = raft::resource::get_stream_from_stream_pool(res);
13 
14     // Keep host work here light. The thread exists to drive GPU work.
15     preprocess_batch(i);
16     my_kernel<<<blocks, threads, 0, stream>>>(i);
17     postprocess_batch(i);
18   }
19 
20   raft::resource::sync_stream_pool(res);
21 }

If there is no CPU work before the first kernel, make the internal streams wait on the main stream with CUDA events. If there is no CPU work after each batch, synchronize the stream pool once after the loop instead of synchronizing inside every iteration.

Asynchronous Operations And Stream Ordering

NVIDIA cuVS algorithms should be asynchronous whenever possible and should avoid the default CUDA stream. For single-stream work, use the stream on raft::resources:

1 #include <raft/core/resource/cuda_stream.hpp>
2 #include <raft/core/resources.hpp>
3 
4 void foo(raft::resources const& res)
5 {
6   cudaStream_t stream = raft::resource::get_cuda_stream(res);
7 }

When an algorithm uses internal streams, preserve ordering with the caller’s stream:

Work already queued on raft::resource::get_cuda_stream(res) must complete before internal stream work starts.
Work queued by the caller after the API returns must wait until all internal stream work is complete.

Use CUDA events and cudaStreamWaitEvent to create those dependencies. This lets users compose NVIDIA cuVS operations with their own asynchronous copies and kernels without accidental races.

Using Thrust

Run Thrust algorithms on the intended stream and memory resource by using the execution policy from raft::resources:

1 #include <raft/core/resource/thrust_policy.hpp>
2 #include <raft/core/resources.hpp>
3 
4 void foo(raft::resources const& res)
5 {
6   auto policy = raft::resource::get_thrust_policy(res);
7   thrust::for_each(policy, first, last, op);
8 }

Resource Management

Do not create reusable CUDA resources directly inside algorithm implementations. Reuse the handles, streams, events, allocators, and library resources attached to raft::resources. If a reusable resource is missing, file an issue or feature request instead of creating a local long-lived resource.

1 #include <raft/core/resource/cublas_handle.hpp>
2 #include <raft/core/resource/cuda_stream_pool.hpp>
3 #include <raft/core/resources.hpp>
4 
5 void foo(raft::resources const& res)
6 {
7   cublasHandle_t cublas_handle = raft::resource::get_cublas_handle(res);
8   auto stream                  = raft::resource::get_stream_from_stream_pool(res);
9 }

Users can configure the stream pool once and pass the same raft::resources object through the API:

1 #include <raft/core/resource/cuda_stream_pool.hpp>
2 #include <raft/core/resources.hpp>
3 #include <rmm/cuda_stream_pool.hpp>
4 
5 int main()
6 {
7   raft::resources res;
8   raft::resource::set_cuda_stream_pool(res, std::make_shared<rmm::cuda_stream_pool>(4));
9 
10   foo(res);
11 }

Multi-GPU

NVIDIA cuVS multi-GPU APIs generally use one of two execution strategies:

Single-process multi-GPU: one host process owns and coordinates multiple GPUs. Use raft::device_resources_snmg in examples and public API documentation for this model. It owns the selected devices, streams, communication resources, and per-device execution state.
One-process-per-GPU: each process owns one GPU and participates as a communication rank. Use raft::device_resources or raft::resources with an initialized raft::comms::comms_t. Access communication through raft::resource::get_comms, and guard optional communication paths with raft::resource::comms_initialized.

APIs may support either strategy or both, but they should document which resource object users are expected to construct. Keep streams, stream pools, memory resources, and communicators attached to the supplied resource object instead of creating unmanaged process-wide state. Single-GPU APIs should not depend on communication libraries or require multi-GPU resources.

For one-process-per-GPU implementations, developers can assume:

raft::comms::comms_t has been initialized correctly.
All participating ranks call the multi-GPU algorithm cooperatively.

Access the communicator through raft::resources:

1 #include <raft/core/resource/comms.hpp>
2 #include <raft/core/resources.hpp>
3 
4 void foo(raft::resources const& res)
5 {
6   auto const& comm = raft::resource::get_comms(res);
7   int rank         = comm.get_rank();
8   int size         = comm.get_size();
9 }

Using Just-in-Time Link-Time Optimization

NVIDIA cuVS is moving new kernels toward JIT link-time optimization. Instead of compiling every kernel variant into the binary, JIT LTO compiles fragments and links the needed combination at runtime.

This helps reduce binary size and enables user-defined functions in NVIDIA cuVS CUDA kernels. For runtime and cache behavior, see JIT Compilation. For implementation guidance, see Link-time Optimization.

Coding Style

Formatting

NVIDIA cuVS uses pre-commit to run formatting, linting, spelling, and copyright checks. Install it with conda:

$ conda install -c conda-forge pre-commit

Run checks before committing:

$ pre-commit run

Run the full suite across the repository when needed:

$ pre-commit run --all-files

You can also install the git hook:

$ pre-commit install

Core Hooks

C++ and CUDA code are formatted with clang-format. NVIDIA cuVS follows the Google C++ style with a few local adjustments documented in cpp/.clang-format:

Empty functions, records, and namespaces are not split.
Indentation is two spaces, including line continuations.
Comments are not reflowed automatically.

Doxygen checks C++ and CUDA API documentation:

$ ./ci/checks/doxygen.sh

codespell catches spelling issues. To apply suggested fixes interactively, run:

$ codespell -i 3 -w .

Include Style

Use #include "..." only for local files in the same algorithm or nearby directory. Use #include <...> for dependencies, primitives, and headers from other algorithms.

To bulk-fix include style issues, run:

$ python ./cpp/scripts/include_checker.py --inplace cpp/include cpp/tests

Copyright

RAPIDS pre-commit hooks check copyright headers on modified tracked files. To run that check manually:

$ pre-commit run -a verify-copyright

Code Quality

Testing

Public APIs need direct test coverage because downstream projects rely on their compile-time and runtime behavior. Prefer tests that exercise the public entry point, cover edge cases, and make the expected behavior visible without requiring downstream projects to catch regressions first.

Performance Benchmarking

The most important implementation details in NVIDIA cuVS are written in C++ and CUDA, so performance-sensitive changes need benchmarks with clear baselines. Benchmarks should show that regressions have not occurred, intended improvements can be reproduced consistently, and the implementation scales as expected across problem sizes and hardware configurations. For relevant indexing APIs, it is often preferable to use the cuVS Bench Tool for reproducible benchmarks.

Attach benchmark results to the relevant GitHub pull request so future reviewers and contributors have an audit trail. Pull requests that change performance-critical code should not be merged without proper benchmarks in place.

For multi-GPU APIs, include scaling measurements whenever the change affects communication, partitioning, synchronization, or resource use across ranks. A change that improves a single-GPU path should not silently reduce multi-GPU efficiency.

Error Handling

Call CUDA and library APIs through the RAFT helper macros, such as RAFT_CUDA_TRY, RAFT_CUBLAS_TRY, and RAFT_CUSOLVER_TRY. They check return values and throw on failure.

Use the _NO_THROW variants only where throwing is unsafe, such as destructors. Those variants log errors without throwing.

Documentation

Public C++ and CUDA APIs require user-facing Doxygen documentation. Document the purpose, parameters, return values, relevant template or overload behavior, and any constraints that affect correct use.