C++ Guidelines
C++ Guidelines
This page collects the engineering conventions that keep cuVS APIs stable, predictable, and easy to maintain. Start with the Contributor Guide, then use this page when designing public APIs, writing CUDA/C++ implementation code, or preparing a change for review.
Local Development
Most cuVS changes can be developed directly in this repository. Cross-project CUDA/C++ work may also require a local RAFT build or temporary downstream pin.
If a consuming project supports source builds, pass CPM_raft_SOURCE=/path/to/raft/source to its CMake configuration. If the downstream project must pin a RAFT branch while related changes are under review, update the FORK and PINNED_TAG arguments to find_and_configure_raft, then revert that pin before the downstream change merges.
If source builds are not being used, install the local RAFT C++ artifacts into the consuming project’s environment before testing the downstream change.
Public Interface
General Guidelines
Public C++ APIs should be stateless wrappers around implementation code in a private detail namespace.
Expose only lightweight, predictable types:
- Plain data structs used for parameters or metadata.
raft::resources, because it owns execution resources rather than algorithm state.raft::spanandraft::mdspanviews for single- and multi-dimensional data.std::optionalfor optional values instead of sentinel pointers.
Prefer references for required inputs. Reserve pointers for established output patterns and avoid exposing temporary implementation classes in public headers.
API Stability
Public APIs are consumed by multiple projects and should change carefully. Add new APIs before removing old ones, deprecate old entry points over a few releases, and avoid changing behavior in ways that downstream users cannot detect at compile time.
Stateless C++ APIs
Avoid public APIs that store algorithm state in non-POD wrapper objects:
Prefer stateless, instantiated overloads for the supported type combinations. Template implementations can still live in detail, but public entry points should be concrete:
Functions On State
When an API creates an index or model object, also expose stateless functions for persistence and transfer. Keep those functions in the same public namespace as the owning algorithm:
Common Design Considerations
- Use
.hppfor headers that can be compiled bygccagainst the CUDA runtime. Use.cuhwhen a header requiresnvcc. - Keep public types simple. They should store state, not perform computation.
- Document every public API with a clear summary, parameter descriptions, and a short usage example when helpful.
- Before adding a primitive, check whether an existing primitive can be extended cleanly. Add a new public API only when the behavior is genuinely distinct.
Performance
Prefer small, explicit choices that avoid hidden overhead:
- Use
cudaDeviceGetAttributeinstead ofcudaDeviceGetPropertiesin performance-critical code. See the CUDA developer blog post on fast device property queries. - Reuse the stream pool on the provided
raft::resourcesobject instead of creating oneraft::resourcesobject per stream. See Threading Model and Resource Management. - Keep CPU work around GPU launches light. If host threads are used, they should coordinate CUDA streams, not perform heavy CPU computation.
Threading Model
cuVS algorithms should be safe to call from multiple host threads when each thread uses its own raft::resources instance. Treat raft::resources as the boundary for CUDA streams, memory resources, communication handles, and library handles.
Inside an algorithm, host threads are acceptable only when they help keep CUDA streams busy. Keep them bounded, prefer OpenMP, and make sure the algorithm still works when OpenMP is disabled.
If there is no CPU work before the first kernel, make the internal streams wait on the main stream with CUDA events. If there is no CPU work after each batch, synchronize the stream pool once after the loop instead of synchronizing inside every iteration.
Asynchronous Operations And Stream Ordering
cuVS algorithms should be asynchronous whenever possible and should avoid the default CUDA stream. For single-stream work, use the stream on raft::resources:
When an algorithm uses internal streams, preserve ordering with the caller’s stream:
- Work already queued on
raft::resource::get_cuda_stream(res)must complete before internal stream work starts. - Work queued by the caller after the API returns must wait until all internal stream work is complete.
Use CUDA events and cudaStreamWaitEvent to create those dependencies. This lets users compose cuVS operations with their own asynchronous copies and kernels without accidental races.
Using Thrust
Run Thrust algorithms on the intended stream and memory resource by using the execution policy from raft::resources:
Resource Management
Do not create reusable CUDA resources directly inside algorithm implementations. Reuse the handles, streams, events, allocators, and library resources attached to raft::resources. If a reusable resource is missing, file an issue or feature request instead of creating a local long-lived resource.
Users can configure the stream pool once and pass the same raft::resources object through the API:
Multi-GPU
cuVS multi-GPU APIs generally use one of two execution strategies:
- Single-process multi-GPU: one host process owns and coordinates multiple GPUs. Use
raft::device_resources_snmgin examples and public API documentation for this model. It owns the selected devices, streams, communication resources, and per-device execution state. - One-process-per-GPU: each process owns one GPU and participates as a communication rank. Use
raft::device_resourcesorraft::resourceswith an initializedraft::comms::comms_t. Access communication throughraft::resource::get_comms, and guard optional communication paths withraft::resource::comms_initialized.
APIs may support either strategy or both, but they should document which resource object users are expected to construct. Keep streams, stream pools, memory resources, and communicators attached to the supplied resource object instead of creating unmanaged process-wide state. Single-GPU APIs should not depend on communication libraries or require multi-GPU resources.
For one-process-per-GPU implementations, developers can assume:
raft::comms::comms_thas been initialized correctly.- All participating ranks call the multi-GPU algorithm cooperatively.
Access the communicator through raft::resources:
Using Just-in-Time Link-Time Optimization
cuVS is moving new kernels toward JIT link-time optimization. Instead of compiling every kernel variant into the binary, JIT LTO compiles fragments and links the needed combination at runtime.
This helps reduce binary size and enables user-defined functions in cuVS CUDA kernels. For runtime and cache behavior, see JIT Compilation. For implementation guidance, see Link-time Optimization.
Coding Style
Formatting
cuVS uses pre-commit to run formatting, linting, spelling, and copyright checks. Install it with conda:
Run checks before committing:
Run the full suite across the repository when needed:
You can also install the git hook:
Core Hooks
C++ and CUDA code are formatted with clang-format. cuVS follows the Google C++ style with a few local adjustments documented in cpp/.clang-format:
- Empty functions, records, and namespaces are not split.
- Indentation is two spaces, including line continuations.
- Comments are not reflowed automatically.
Doxygen checks C++ and CUDA API documentation:
codespell catches spelling issues. To apply suggested fixes interactively, run:
Include Style
Use #include "..." only for local files in the same algorithm or nearby directory. Use #include <...> for dependencies, primitives, and headers from other algorithms.
To bulk-fix include style issues, run:
Copyright
RAPIDS pre-commit hooks check copyright headers on modified tracked files. To run that check manually:
Code Quality
Testing
Public APIs need direct test coverage because downstream projects rely on their compile-time and runtime behavior. Prefer tests that exercise the public entry point, cover edge cases, and make the expected behavior visible without requiring downstream projects to catch regressions first.
Performance Benchmarking
The most important implementation details in cuVS are written in C++ and CUDA, so performance-sensitive changes need benchmarks with clear baselines. Benchmarks should show that regressions have not occurred, intended improvements can be reproduced consistently, and the implementation scales as expected across problem sizes and hardware configurations. For relevant indexing APIs, it is often preferable to use the cuVS Bench Tool for reproducible benchmarks.
Attach benchmark results to the relevant GitHub pull request so future reviewers and contributors have an audit trail. Pull requests that change performance-critical code should not be merged without proper benchmarks in place.
For multi-GPU APIs, include scaling measurements whenever the change affects communication, partitioning, synchronization, or resource use across ranks. A change that improves a single-GPU path should not silently reduce multi-GPU efficiency.
Error Handling
Call CUDA and library APIs through the RAFT helper macros, such as RAFT_CUDA_TRY, RAFT_CUBLAS_TRY, and RAFT_CUSOLVER_TRY. They check return values and throw on failure.
Use the _NO_THROW variants only where throwing is unsafe, such as destructors. Those variants log errors without throwing.
Documentation
Public C++ and CUDA APIs require user-facing Doxygen documentation. Document the purpose, parameters, return values, relevant template or overload behavior, and any constraints that affect correct use.