DistCollective¶

Performs distributed collective communication operations across multiple GPUs using NCCL (NVIDIA Collective Communications Library) at runtime.

Attributes¶

collective_operation The collective operation type. Can be one of:

ALL_REDUCE All ranks reduce their input tensors using the specified reduce operation, and all ranks receive the same result.
ALL_GATHER All ranks gather tensors from all other ranks, concatenating them along the first dimension.
BROADCAST The root rank broadcasts its input tensor to all other ranks.
REDUCE All ranks reduce their input tensors to the root rank using the specified reduce operation.
REDUCE_SCATTER All ranks reduce their input tensors and scatter the result, with each rank receiving a portion along the first dimension.
ALL_TO_ALL Each rank exchanges equal-sized blocks with every other rank. input.shape[0] must be divisible by nb_rank.
GATHER Each rank sends its tensor to the root rank, which concatenates contributions along the first dimension.
SCATTER The root rank splits its tensor along the first dimension and sends one chunk to each rank.

reduce_op The reduction operation for reduction-type collectives. Can be one of SUM, PROD, MAX, MIN, AVG, or NONE. Must not be NONE for ALL_REDUCE, REDUCE, and REDUCE_SCATTER. For ALL_GATHER, BROADCAST, ALL_TO_ALL, GATHER, and SCATTER, reduce_op must be NONE.

root The root rank for root-based operations (BROADCAST, REDUCE, GATHER, SCATTER). Must be >= 0 for those operations. Use -1 for collectives that do not use a root (ALL_REDUCE, ALL_GATHER, REDUCE_SCATTER, ALL_TO_ALL).

nb_rank The number of ranks participating in the collective (default: 1). When > 1, enables multi-device execution. It is required for correct shapes and validation:

ALL_GATHER and GATHER: output leading dimension is nb_rank * input.shape[0].
REDUCE_SCATTER and SCATTER: output leading dimension is input.shape[0] / nb_rank; input.shape[0] must be divisible by nb_rank.
ALL_TO_ALL: input and output share the same shape; input.shape[0] divisible by nb_rank.

groups Optional array of rank IDs defining a communication group. If empty or not provided, all ranks participate.

group_size Number of elements in the groups array. Must be 0 when groups is empty; must be > 0 when groups is non-empty.

Inputs¶

input tensor of type T. The input tensor to the collective operation.

For BROADCAST, only the root rank supplies a meaningful input; non-root ranks do not send data.

For SCATTER, only the root rank supplies the full tensor to scatter; non-root inputs are not used by NCCL.

For GATHER, every rank provides an input chunk; only the root rank receives the concatenated result (non-root outputs are not written by NCCL, though buffers may still be allocated).

Outputs¶

output tensor of type T.

For REDUCE, only the root rank has a meaningful output; non-root ranks do not receive reduced data.

For ALL_GATHER: output shape is [nb_rank * input.shape[0], input.shape[1], ...].

For GATHER: same leading dimension rule as ALL_GATHER on the root output; see Inputs for non-root behavior.

For REDUCE_SCATTER and SCATTER: output shape is [input.shape[0] / nb_rank, input.shape[1], ...]; input.shape[0] must be divisible by nb_rank.

For ALL_TO_ALL: output has the same shape as input; input.shape[0] must be divisible by nb_rank.

For ALL_REDUCE and BROADCAST, output has the same shape as input. For REDUCE, the root rank’s output has the same shape as each rank’s input tensor.

Data Types¶

T: float32, float16, bfloat16, float8, int64, int32, int8, uint8, bool

Shape Information¶

input tensor with shape \([d_0, d_1, ..., d_n]\), \(n \geq 1\).

output for ALL_GATHER and GATHER (root semantics): \([nb\_rank \cdot d_0, d_1, ..., d_n]\).

output for REDUCE_SCATTER and SCATTER: \([d_0 / nb\_rank, d_1, ..., d_n]\) where \(d_0\) must be divisible by nb_rank.

output for ALL_TO_ALL: same shape as input; \(d_0\) must be divisible by nb_rank.

output for ALL_REDUCE, BROADCAST, REDUCE (where applicable): same shape as the participating rank’s input tensor.

DLA Support¶

Not supported.

Multi-Device Runtime Requirements¶

To use DistCollective with nb_rank > 1:

DistCollective is supported on GPUs with SM >= 80 (Ampere or newer).
An NCCL communicator must be initialized and set on the execution context via IExecutionContext::setCommunicator before inference.
All participating ranks must execute the same network with synchronized execution calls.

Examples¶

For a complete multi-device example, refer to sampleDistCollective.

C++ API¶

For more information about the C++ IDistCollectiveLayer operator, refer to the C++ IDistCollectiveLayer documentation.

Python API¶

For more information about the Python IDistCollectiveLayer operator, refer to the Python IDistCollectiveLayer documentation.