Status and Utility Methods
Methods on Communicator for resource cleanup and error/status
queries.
close_all_resources
- Communicator.close_all_resources() None
Closes all resources owned by this communicator.
Called automatically during
destroy()andabort(), but can be called manually. Performs best-effort cleanup, ignoring any errors that occur during resource deallocation. Idempotent: safe to call multiple times.
get_last_error
- Communicator.get_last_error() str
Returns the last error string for this communicator.
- Raises:
NcclInvalid – If the communicator is not initialized.
get_async_error
- Communicator.get_async_error() nccl.bindings.nccl.Result
Queries the progress and potential errors of asynchronous NCCL operations.
Operations without a stream argument (e.g.
finalize()) are complete when they returnncclSuccess. Operations with a stream argument (e.g.reduce()) returnncclSuccesswhen posted but may report errors through this method until completed. If any NCCL function returnsncclInProgress, users must query the communicator state until it becomesncclSuccessbefore calling another NCCL function.Before the state becomes
ncclSuccess, do not issue CUDA kernels on streams used by NCCL. If an error occurs, destroy the communicator withabort(); nothing can be assumed about the completion or correctness of enqueued operations after an error.- Returns:
Current state of the communicator (
ncclSuccess,ncclInProgress, or an error code).- Raises:
NcclInvalid – If the communicator is not initialized.
See also
get_mem_stat
- Communicator.get_mem_stat(stat: NcclCommMemStat) int
Queries communicator memory statistics.
- Parameters:
stat – The memory statistic to query.
- Returns:
The memory statistic value (bytes, or 0/1 for GPU_MEM_SUSPENDED).
- Raises:
NcclInvalid – If the communicator is not initialized.
NcclCommMemStat
- class nccl.core.NcclCommMemStat(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)
Bases:
IntEnumMemory-statistic selector, mirroring
ncclCommMemStat_t.Used as the
statargument ofCommunicator.get_mem_stat()to identify which memory statistic to query. All values are returned in bytes exceptGPU_MEM_SUSPENDED, which is a 0/1 flag.- GPU_MEM_SUSPEND = 0
Communicator-allocated GPU memory that can be released by
Communicator.suspend()(bytes).
- GPU_MEM_SUSPENDED = 1
Whether communicator-allocated GPU memory is currently suspended (
0= active,1= suspended).
- GPU_MEM_PERSIST = 2
Communicator-allocated GPU memory that cannot be suspended (bytes).
- GPU_MEM_TOTAL = 3
Total communicator-allocated GPU memory tracked by NCCL (bytes).
get_error_string
Module-level helper to render an NCCL result code as a human-readable string.
- nccl.core.get_error_string(nccl_result: _nccl_bindings.Result | int) str
Returns a human-readable error string for an NCCL result code.
- Parameters:
nccl_result – NCCL result code.
- Returns:
Human-readable error message corresponding to the result code.