Framework Interop
Lazy-loaded helpers for allocating CuPy arrays and PyTorch tensors backed by
NCCL-managed memory, plus resolvers that translate framework objects into
the (ptr, count, dtype, device_id) tuple NCCL expects. The submodules
are imported on first attribute access via nccl.core.cupy and
nccl.core.torch.
CuPy
- nccl.core.interop.cupy.empty(shape: int | tuple[int, ...], dtype: str | np.dtype | cupy.dtype | type = <class 'float'>, order: Literal['C', 'F'] = 'C') cupy.ndarray
Creates an uninitialized CuPy array backed by NCCL-allocated memory.
Returns an array filled with uninitialized data using NCCL’s memory allocator. This provides a CuPy-compatible interface while using NCCL’s memory allocator for efficient GPU memory management in distributed scenarios. Unlike cupy.empty, the underlying memory is allocated through NCCL.
Memory is automatically freed when the array is garbage collected; no explicit free call is required. For zero-copy optimization, register the array using
register_buffer()orregister_window().- Parameters:
shape – Shape of the array.
dtype – Data type specifier. Defaults to
float.order – Memory layout. ‘C’ for row-major (C-style), ‘F’ for column-major (Fortran-style). Defaults to ‘C’.
- Returns:
An uninitialized CuPy array backed by NCCL-allocated memory.
- Raises:
NcclInvalid – If order is not ‘C’ or ‘F’.
ModuleNotFoundError – If CuPy is not installed.
- nccl.core.interop.cupy.resolve_array(array: cupy.ndarray) tuple[int, int, NcclDataType, int]
Resolves a CuPy array to its NCCL buffer descriptor.
- Parameters:
array – CuPy array to resolve.
- Returns:
Tuple of (ptr, count, dtype, device_id) – device pointer, element count, NCCL data type, and CUDA device ID.
- Raises:
ModuleNotFoundError – If CuPy is not installed.
NcclInvalid – If array is not a CuPy ndarray or its dtype has no NCCL equivalent.
PyTorch
- nccl.core.interop.torch.empty(*size, dtype: torch.dtype | None = None, device: torch.device | int | str | None = None, morder: Literal['C', 'F'] = 'C') torch.Tensor
Creates an uninitialized PyTorch tensor backed by NCCL-allocated memory.
Returns a tensor filled with uninitialized data using NCCL’s memory allocator. This provides a PyTorch-compatible interface while using NCCL’s memory allocator for efficient GPU memory management in distributed scenarios. Unlike torch.empty, the underlying memory is allocated through NCCL.
Memory is automatically freed when the tensor is garbage collected; no explicit free call is required. For zero-copy optimization, register the tensor using
register_buffer()orregister_window().- Parameters:
*size – A sequence of integers defining the shape of the output tensor. Can be a variable number of arguments or a single list/tuple.
dtype – Desired data type of the tensor. If
None, uses torch.get_default_dtype(). Defaults toNone.device – Device of the tensor. If
None, uses the current CUDA device. Defaults toNone.morder – Memory layout. ‘C’ for row-major (C-style), ‘F’ for column-major (Fortran-style). Defaults to ‘C’.
- Returns:
An uninitialized PyTorch tensor backed by NCCL-allocated memory.
- Raises:
NcclInvalid – If morder is not ‘C’ or ‘F’, or device is not a CUDA device.
ModuleNotFoundError – If PyTorch is not installed.
- nccl.core.interop.torch.resolve_tensor(tensor: torch.Tensor) tuple[int, int, NcclDataType, int]
Resolves a PyTorch tensor to its NCCL buffer descriptor.
- Parameters:
tensor – PyTorch tensor to resolve.
- Returns:
Tuple of (ptr, count, dtype, device_id) – device pointer, element count, NCCL data type, and CUDA device ID.
- Raises:
ModuleNotFoundError – If PyTorch is not installed.
NcclInvalid – If tensor is not a PyTorch tensor or its dtype has no NCCL equivalent.