Creation and Lifecycle Methods
Methods on Communicator for creation, splitting, growing, and
teardown.
Construction
- classmethod Communicator.init(nranks: int, rank: int, unique_id: UniqueId | Sequence[UniqueId], config: NCCLConfig | None = None) Communicator
Initializes a new NCCL communicator.
Creates a communicator that connects multiple ranks. This is a collective operation: all ranks must call this method with the same
nranksandunique_idbut with differentrankvalues.- Parameters:
nranks – Total number of ranks in the communicator.
rank – This rank (must be between 0 and
nranks - 1).unique_id – Unique identifier(s) shared by all ranks. A sequence may be passed to use
ncclCommInitRankScalable().config – NCCL configuration options. Defaults to
None.
- Returns:
A new Communicator instance.
- Raises:
NcclInvalid – If
unique_idhas an invalid type.
- classmethod Communicator.init_all(devices: int | Sequence[int] | None = None) list[Communicator]
Initializes multiple NCCL communicators for single-process multi-GPU operations.
Creates an array of NCCL communicators, one for each device, within a single process. This is optimized for single-machine scenarios where all GPUs are controlled by the same process. Unlike
init(), which requires multi-process coordination (e.g. via MPI),init_all()handles all coordination internally.Each communicator is bound to its corresponding device and has its rank equal to its index in the returned list. The current device context is preserved by the underlying NCCL API. All communicators must be manually destroyed via
destroy()on each one.- Parameters:
devices – Specifies which devices to initialize.
None(the default) initializes all visible CUDA devices. An int creates communicators for devices[0, 1, ..., devices - 1]. A sequence of ints uses the explicit device IDs. If the resulting device list is empty (devices=0, an empty sequence, or no visible devices), returns an empty list without calling into NCCL.- Returns:
List of initialized communicators, one per device. Rank
iusesdevices[i](or deviceiwhendevicesis an int).- Raises:
TypeError – If
devicesis not an int, sequence of ints, orNone.
- Communicator.initialize(nranks: int, rank: int, unique_id: UniqueId | Sequence[UniqueId], config: NCCLConfig | None = None) None
Initializes this communicator in-place.
Instance-method counterpart of the
init()classmethod. Allows creating a null communicator first (viaCommunicator()) and initializing it later. This is a collective operation; all ranks must call this method.- Parameters:
nranks – Total number of ranks in the communicator.
rank – This rank (must be between 0 and
nranks - 1).unique_id – Unique identifier(s) shared by all ranks.
config – NCCL configuration options. Defaults to
None.
- Raises:
NcclInvalid – If
unique_idhas an invalid type or this communicator is already initialized.
Bootstrap identifier
A UniqueId is generated by one rank (typically rank 0) and
broadcast to all participating ranks; all ranks then pass it to
Communicator.init().
- class nccl.core.UniqueId(_internal: _nccl_bindings.UniqueId | None = None)
Bases:
objectNCCL unique identifier for communicator initialization.
A UniqueId is used to coordinate communicator initialization across multiple ranks. All ranks must use the same UniqueId to form a communicator. Typically one rank generates the UniqueId via
get_unique_id()and broadcasts it to all other ranks. Three serialization paths are supported:Bytes:
bytes(uid)(oras_bytes) on the producer,from_bytes()on receivers. The bytes of unique ID can be transmitted through any byte-oriented channel — a TCP socket, a shared filesystem, etc.NumPy:
as_ndarrayreturns an in-place view of the underlying buffer, suitable for NumPy-aware buffer transports such asmpi4py.MPI.Comm.Bcast(uppercaseB).Pickle: instances are picklable directly, so higher level object broadcast helpers like
mpi4py.MPI.Comm.bcast(lowercaseb) work out of the box.
- static from_bytes(b: bytes | bytearray | memoryview) UniqueId
Reconstructs a UniqueId from a bytes-like buffer.
- property as_ndarray: numpy.ndarray
NumPy array view of the unique ID data.
- property as_bytes: bytes
Bytes representation of the unique ID, suitable for serialization or broadcast.
- nccl.core.get_unique_id(empty: bool = False) UniqueId
Generates a new NCCL unique identifier for communicator initialization.
Should be called by one rank (typically rank 0); the resulting
UniqueIdmust then be broadcast (e.g. via MPI) to all other ranks.- Parameters:
empty – If True, return an empty
UniqueIdwithout calling NCCL. Useful when the bytes will be filled in later viaUniqueId.from_bytes(). Defaults to False.- Returns:
A new
UniqueIdto be shared across ranks.
Splitting and growing
- Communicator.split(color: int | None = None, key: int = 0, config: NCCLConfig | None = None) Communicator
Splits this communicator into sub-communicators based on color values.
Ranks that pass the same
colorvalue will be part of the same group. IfcolorisNone, the rank will not be part of any group and receives a null communicator (aCommunicatorinstance withptr=0). Thekeyvalue determines rank ordering; smallerkeymeans smaller rank in the new communicator. If keys are equal, the rank in the original communicator determines ordering.This is a collective operation: all ranks in the communicator must call this method, even ranks that pass
color=None. There must be no outstanding NCCL operations on the communicator to avoid deadlock.- Parameters:
color – Non-negative color value for grouping ranks. Pass
Noneto exclude this rank from all groups. Defaults toNone.key – Ordering key within the color group. Defaults to 0.
config – Configuration for the new communicator. If
None, inherits the parent’s configuration. Defaults toNone.
- Returns:
New sub-communicator, or a null communicator if
colorisNone.- Raises:
NcclInvalid – If the communicator is not initialized.
See also
- Communicator.shrink(exclude_ranks: Sequence[int] | None = None, config: NCCLConfig | None = None, flag: CommShrinkFlag = CommShrinkFlag.DEFAULT) Communicator
Creates a new communicator by removing specified ranks from this one.
Ranks listed in
exclude_ranksare excluded from the new communicator; the remaining ranks are renumbered to a contiguous[0, n)range.This is a collective operation. All non-excluded ranks must call this method; excluded ranks must NOT call it. With
DEFAULTthere must be no outstanding NCCL operations to avoid deadlock; combine withconfig.shrink_share=Trueto reuse parent communicator resources. WithABORToutstanding operations are automatically aborted and no resources are shared with the parent.- Parameters:
exclude_ranks – Ranks to exclude from the new communicator. Defaults to
None(no exclusions).config – Configuration for the new communicator. If
None, inherits the parent’s configuration. Defaults toNone.flag – Shrink behavior. Use
DEFAULTfor normal operation orABORTafter errors. Defaults toDEFAULT.
- Returns:
New communicator without the excluded ranks.
- Raises:
NcclInvalid – If the communicator is not initialized.
See also
- Communicator.get_unique_id() UniqueId
Returns a per-communicator unique ID for use with
grow().Generates a unique identifier bound to this communicator that can be shared with new ranks joining via
grow(). This is distinct from the globalget_unique_id()used for initial communicator creation. Only one existing rank (the grow root) should call this method.A new UID cannot be generated while a previous UID is unconsumed; each UID can be used only once and the user must wait for the corresponding grow operation to complete before calling again.
- Returns:
UniqueIdfor grow operations.- Raises:
NcclInvalid – If the communicator is not initialized.
- Communicator.grow(nranks: int, unique_id: UniqueId | None = None, rank: int | None = None, config: NCCLConfig | None = None) Communicator
Grows the communicator by adding new ranks.
Creates a new communicator that includes both existing ranks from this communicator and new ranks joining the group. There are three roles:
Existing root: the one existing rank that called
get_unique_id().Existing non-root: all other existing ranks.
New ranks: ranks joining via a null communicator (
Communicator()).
This is a collective operation. All ranks (existing and new) must call this method. Usage by role:
Existing root:
new_comm = existing_comm.grow(nranks, uid)Existing non-root:
new_comm = existing_comm.grow(nranks)New rank:
new_comm = Communicator().grow(nranks, uid, rank=assigned_rank)
The UID is consumed upon successful grow and cannot be reused.
- Parameters:
nranks – Total number of ranks in the new communicator (existing plus new). All roles must pass the same value.
unique_id – Unique identifier from
get_unique_id(). Existing root and new ranks must pass theUniqueId; existing non-root must passNone. Defaults toNone.rank – This rank’s ID in the new communicator. New ranks must pass their assigned rank, which must be
>=the parent communicator size. Existing ranks must passNone. Defaults toNone.config – Configuration for the new communicator. Defaults to
None.
- Returns:
New
Communicatorcontaining all ranks.- Raises:
NcclInvalid – If a new rank is given an initialized communicator, or an existing rank is given a null communicator.
Teardown
- Communicator.destroy() None
Destroys the communicator and frees local resources.
If
finalize()has not been called explicitly,destroy()will call it internally. Iffinalize()is called explicitly, users must ensure the communicator state becomesncclSuccessbefore callingdestroy(). The communicator should not be accessed afterdestroy()returns.All resources (registered buffers, windows, custom operators) owned by this communicator are automatically closed before destruction. This is an intra-node collective call: all ranks on the same node must call it to avoid hanging. The recommended pattern is
finalize()followed bydestroy().Errors during cleanup are suppressed for safety.
See also
- Communicator.abort() None
Aborts the communicator and frees resources, terminating in-flight operations.
Should be called when an unrecoverable error occurs. Unlike
destroy(), this immediately aborts uncompleted operations. All active ranks must call this function in order to abort the NCCL communicator successfully.All resources (registered buffers, windows, custom operators) owned by this communicator are automatically closed before aborting. Errors during cleanup are suppressed for safety. For more details, see the Fault Tolerance section in the NCCL documentation.
See also
- Communicator.finalize() None
Finalizes the communicator, flushing uncompleted operations and network resources.
Typically called before
destroy()to ensure all operations complete. This is a collective operation that must be called by all ranks.For nonblocking communicators this is itself nonblocking: success sets the communicator state to
ncclInProgressto indicate finalization is in progress. Once all NCCL operations complete, the communicator transitions toncclSuccess. Users can query the state withget_async_error().See also
Pause and resume
- Communicator.revoke(flags: int = 0) None
Revokes the communicator.
Stops all in-flight operations and marks the communicator state as
ncclInProgress. The state transitions toncclSuccesswhen the communicator becomes quiescent, after which management operations (destroy(),split(),shrink()) can proceed safely.Calling
finalize()afterrevoke()is invalid. Resource sharing via split-share / shrink-share is disabled while revoked.- Parameters:
flags – Reserved for future use. Currently must be 0.
- Raises:
NcclInvalid – If the communicator is not initialized.
- Communicator.suspend(flags: CommSuspendFlag = CommSuspendFlag.MEM) None
Suspends communicator operations to free resources.
The communicator cannot be used for communication while suspended. Call
resume()to restore it.- Parameters:
flags – Suspend flags controlling what resources to release.
MEMreleases dynamic GPU memory allocations.- Raises:
NcclInvalid – If the communicator is not initialized.
- Communicator.resume() None
Resumes all previously suspended communicator resources.
Restores a communicator that was suspended with
suspend()so that it can be used for communication again.- Raises:
NcclInvalid – If the communicator is not initialized.
Flag enums
CommShrinkFlag
- class nccl.core.CommShrinkFlag(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)
Bases:
IntEnumBehavior flag for
Communicator.shrink().- DEFAULT = 0
Shrink the parent communicator normally; outstanding NCCL operations must already be quiesced.
- ABORT = 1
First terminate ongoing parent operations, then shrink. No resources are shared with the parent.
CommSuspendFlag
- class nccl.core.CommSuspendFlag(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)
Bases:
IntFlagBehavior flag for
Communicator.suspend().- MEM = 1
Suspend memory by releasing dynamic GPU memory allocations held by the communicator.