C++ Guidelines

View as Markdown

This page collects the engineering conventions that keep cuVS APIs stable, predictable, and easy to maintain. Start with the Contributor Guide, then use this page when designing public APIs, writing CUDA/C++ implementation code, or preparing a change for review.

Local Development

Most cuVS changes can be developed directly in this repository. Cross-project CUDA/C++ work may also require a local RAFT build or temporary downstream pin.

If a consuming project supports source builds, pass CPM_raft_SOURCE=/path/to/raft/source to its CMake configuration. If the downstream project must pin a RAFT branch while related changes are under review, update the FORK and PINNED_TAG arguments to find_and_configure_raft, then revert that pin before the downstream change merges.

If source builds are not being used, install the local RAFT C++ artifacts into the consuming project’s environment before testing the downstream change.

Public Interface

General Guidelines

Public C++ APIs should be stateless wrappers around implementation code in a private detail namespace.

Expose only lightweight, predictable types:

  1. Plain data structs used for parameters or metadata.
  2. raft::resources, because it owns execution resources rather than algorithm state.
  3. raft::span and raft::mdspan views for single- and multi-dimensional data.
  4. std::optional for optional values instead of sentinel pointers.

Prefer references for required inputs. Reserve pointers for established output patterns and avoid exposing temporary implementation classes in public headers.

API Stability

Public APIs are consumed by multiple projects and should change carefully. Add new APIs before removing old ones, deprecate old entry points over a few releases, and avoid changing behavior in ways that downstream users cannot detect at compile time.

Stateless C++ APIs

Avoid public APIs that store algorithm state in non-POD wrapper objects:

1class ivf_pq_float {
2 ivf_pq::index_params params_;
3 raft::resources const& res_;
4
5 public:
6 ivf_pq_float(raft::resources const& res);
7
8 void train(raft::device_matrix_view<const float, int64_t, raft::row_major> dataset);
9
10 void search(raft::device_matrix_view<const float, int64_t, raft::row_major> queries,
11 raft::device_matrix_view<int64_t, int64_t, raft::row_major> neighbors,
12 raft::device_matrix_view<float, int64_t, raft::row_major> distances);
13};

Prefer stateless, instantiated overloads for the supported type combinations. Template implementations can still live in detail, but public entry points should be concrete:

1namespace cuvs::neighbors::ivf_pq {
2
3auto build(raft::resources const& res,
4 index_params const& params,
5 raft::device_matrix_view<const float, int64_t, raft::row_major> dataset)
6 -> index<int64_t>;
7
8void build(raft::resources const& res,
9 index_params const& params,
10 raft::device_matrix_view<const float, int64_t, raft::row_major> dataset,
11 index<int64_t>* idx);
12
13void search(raft::resources const& res,
14 search_params const& params,
15 index<int64_t> const& idx,
16 raft::device_matrix_view<const float, int64_t, raft::row_major> queries,
17 raft::device_matrix_view<int64_t, int64_t, raft::row_major> neighbors,
18 raft::device_matrix_view<float, int64_t, raft::row_major> distances);
19
20// Add supported variants, such as half or int8_t, as separate overloads.
21}

Functions On State

When an API creates an index or model object, also expose stateless functions for persistence and transfer. Keep those functions in the same public namespace as the owning algorithm:

1namespace cuvs::neighbors::ivf_pq {
2
3void serialize(raft::resources const& res, std::ostream& os, index<int64_t> const& index);
4
5void deserialize(raft::resources const& res, std::istream& is, index<int64_t>* index);
6
7} // namespace cuvs::neighbors::ivf_pq

Common Design Considerations

  1. Use .hpp for headers that can be compiled by gcc against the CUDA runtime. Use .cuh when a header requires nvcc.
  2. Keep public types simple. They should store state, not perform computation.
  3. Document every public API with a clear summary, parameter descriptions, and a short usage example when helpful.
  4. Before adding a primitive, check whether an existing primitive can be extended cleanly. Add a new public API only when the behavior is genuinely distinct.

Performance

Prefer small, explicit choices that avoid hidden overhead:

  1. Use cudaDeviceGetAttribute instead of cudaDeviceGetProperties in performance-critical code. See the CUDA developer blog post on fast device property queries.
  2. Reuse the stream pool on the provided raft::resources object instead of creating one raft::resources object per stream. See Threading Model and Resource Management.
  3. Keep CPU work around GPU launches light. If host threads are used, they should coordinate CUDA streams, not perform heavy CPU computation.

Threading Model

cuVS algorithms should be safe to call from multiple host threads when each thread uses its own raft::resources instance. Treat raft::resources as the boundary for CUDA streams, memory resources, communication handles, and library handles.

Inside an algorithm, host threads are acceptable only when they help keep CUDA streams busy. Keep them bounded, prefer OpenMP, and make sure the algorithm still works when OpenMP is disabled.

1#include <raft/core/resource/cuda_stream.hpp>
2#include <raft/core/resource/cuda_stream_pool.hpp>
3#include <raft/core/resources.hpp>
4
5void run_batches(raft::resources const& res, int n_batches)
6{
7 auto main_stream = raft::resource::get_cuda_stream(res);
8 raft::resource::sync_stream(res, main_stream);
9
10#pragma omp parallel for
11 for (int i = 0; i < n_batches; ++i) {
12 auto stream = raft::resource::get_stream_from_stream_pool(res);
13
14 // Keep host work here light. The thread exists to drive GPU work.
15 preprocess_batch(i);
16 my_kernel<<<blocks, threads, 0, stream>>>(i);
17 postprocess_batch(i);
18 }
19
20 raft::resource::sync_stream_pool(res);
21}

If there is no CPU work before the first kernel, make the internal streams wait on the main stream with CUDA events. If there is no CPU work after each batch, synchronize the stream pool once after the loop instead of synchronizing inside every iteration.

Asynchronous Operations And Stream Ordering

cuVS algorithms should be asynchronous whenever possible and should avoid the default CUDA stream. For single-stream work, use the stream on raft::resources:

1#include <raft/core/resource/cuda_stream.hpp>
2#include <raft/core/resources.hpp>
3
4void foo(raft::resources const& res)
5{
6 cudaStream_t stream = raft::resource::get_cuda_stream(res);
7}

When an algorithm uses internal streams, preserve ordering with the caller’s stream:

  1. Work already queued on raft::resource::get_cuda_stream(res) must complete before internal stream work starts.
  2. Work queued by the caller after the API returns must wait until all internal stream work is complete.

Use CUDA events and cudaStreamWaitEvent to create those dependencies. This lets users compose cuVS operations with their own asynchronous copies and kernels without accidental races.

Using Thrust

Run Thrust algorithms on the intended stream and memory resource by using the execution policy from raft::resources:

1#include <raft/core/resource/thrust_policy.hpp>
2#include <raft/core/resources.hpp>
3
4void foo(raft::resources const& res)
5{
6 auto policy = raft::resource::get_thrust_policy(res);
7 thrust::for_each(policy, first, last, op);
8}

Resource Management

Do not create reusable CUDA resources directly inside algorithm implementations. Reuse the handles, streams, events, allocators, and library resources attached to raft::resources. If a reusable resource is missing, file an issue or feature request instead of creating a local long-lived resource.

1#include <raft/core/resource/cublas_handle.hpp>
2#include <raft/core/resource/cuda_stream_pool.hpp>
3#include <raft/core/resources.hpp>
4
5void foo(raft::resources const& res)
6{
7 cublasHandle_t cublas_handle = raft::resource::get_cublas_handle(res);
8 auto stream = raft::resource::get_stream_from_stream_pool(res);
9}

Users can configure the stream pool once and pass the same raft::resources object through the API:

1#include <raft/core/resource/cuda_stream_pool.hpp>
2#include <raft/core/resources.hpp>
3#include <rmm/cuda_stream_pool.hpp>
4
5int main()
6{
7 raft::resources res;
8 raft::resource::set_cuda_stream_pool(res, std::make_shared<rmm::cuda_stream_pool>(4));
9
10 foo(res);
11}

Multi-GPU

cuVS multi-GPU APIs generally use one of two execution strategies:

  1. Single-process multi-GPU: one host process owns and coordinates multiple GPUs. Use raft::device_resources_snmg in examples and public API documentation for this model. It owns the selected devices, streams, communication resources, and per-device execution state.
  2. One-process-per-GPU: each process owns one GPU and participates as a communication rank. Use raft::device_resources or raft::resources with an initialized raft::comms::comms_t. Access communication through raft::resource::get_comms, and guard optional communication paths with raft::resource::comms_initialized.

APIs may support either strategy or both, but they should document which resource object users are expected to construct. Keep streams, stream pools, memory resources, and communicators attached to the supplied resource object instead of creating unmanaged process-wide state. Single-GPU APIs should not depend on communication libraries or require multi-GPU resources.

For one-process-per-GPU implementations, developers can assume:

  1. raft::comms::comms_t has been initialized correctly.
  2. All participating ranks call the multi-GPU algorithm cooperatively.

Access the communicator through raft::resources:

1#include <raft/core/resource/comms.hpp>
2#include <raft/core/resources.hpp>
3
4void foo(raft::resources const& res)
5{
6 auto const& comm = raft::resource::get_comms(res);
7 int rank = comm.get_rank();
8 int size = comm.get_size();
9}

cuVS is moving new kernels toward JIT link-time optimization. Instead of compiling every kernel variant into the binary, JIT LTO compiles fragments and links the needed combination at runtime.

This helps reduce binary size and enables user-defined functions in cuVS CUDA kernels. For runtime and cache behavior, see JIT Compilation. For implementation guidance, see Link-time Optimization.

Coding Style

Formatting

cuVS uses pre-commit to run formatting, linting, spelling, and copyright checks. Install it with conda:

$conda install -c conda-forge pre-commit

Run checks before committing:

$pre-commit run

Run the full suite across the repository when needed:

$pre-commit run --all-files

You can also install the git hook:

$pre-commit install

Core Hooks

C++ and CUDA code are formatted with clang-format. cuVS follows the Google C++ style with a few local adjustments documented in cpp/.clang-format:

  1. Empty functions, records, and namespaces are not split.
  2. Indentation is two spaces, including line continuations.
  3. Comments are not reflowed automatically.

Doxygen checks C++ and CUDA API documentation:

$./ci/checks/doxygen.sh

codespell catches spelling issues. To apply suggested fixes interactively, run:

$codespell -i 3 -w .

Include Style

Use #include "..." only for local files in the same algorithm or nearby directory. Use #include <...> for dependencies, primitives, and headers from other algorithms.

To bulk-fix include style issues, run:

$python ./cpp/scripts/include_checker.py --inplace cpp/include cpp/tests

RAPIDS pre-commit hooks check copyright headers on modified tracked files. To run that check manually:

$pre-commit run -a verify-copyright

Code Quality

Testing

Public APIs need direct test coverage because downstream projects rely on their compile-time and runtime behavior. Prefer tests that exercise the public entry point, cover edge cases, and make the expected behavior visible without requiring downstream projects to catch regressions first.

Performance Benchmarking

The most important implementation details in cuVS are written in C++ and CUDA, so performance-sensitive changes need benchmarks with clear baselines. Benchmarks should show that regressions have not occurred, intended improvements can be reproduced consistently, and the implementation scales as expected across problem sizes and hardware configurations. For relevant indexing APIs, it is often preferable to use the cuVS Bench Tool for reproducible benchmarks.

Attach benchmark results to the relevant GitHub pull request so future reviewers and contributors have an audit trail. Pull requests that change performance-critical code should not be merged without proper benchmarks in place.

For multi-GPU APIs, include scaling measurements whenever the change affects communication, partitioning, synchronization, or resource use across ranks. A change that improves a single-GPU path should not silently reduce multi-GPU efficiency.

Error Handling

Call CUDA and library APIs through the RAFT helper macros, such as RAFT_CUDA_TRY, RAFT_CUBLAS_TRY, and RAFT_CUSOLVER_TRY. They check return values and throw on failure.

Use the _NO_THROW variants only where throwing is unsafe, such as destructors. Those variants log errors without throwing.

Documentation

Public C++ and CUDA APIs require user-facing Doxygen documentation. Document the purpose, parameters, return values, relevant template or overload behavior, and any constraints that affect correct use.