DistCollective

Performs distributed collective communication operations across multiple GPUs using NCCL (NVIDIA Collective Communications Library) at runtime.

Attributes

collective_operation The collective operation type. Can be one of:

  • ALL_REDUCE All ranks reduce their input tensors using the specified reduce operation, and all ranks receive the same result.

  • ALL_GATHER All ranks gather tensors from all other ranks, concatenating them along the first dimension.

  • BROADCAST The root rank broadcasts its input tensor to all other ranks.

  • REDUCE All ranks reduce their input tensors to the root rank using the specified reduce operation.

  • REDUCE_SCATTER All ranks reduce their input tensors and scatter the result, with each rank receiving a portion along the first dimension.

  • ALL_TO_ALL Each rank exchanges equal-sized blocks with every other rank. input.shape[0] must be divisible by nb_rank.

  • GATHER Each rank sends its tensor to the root rank, which concatenates contributions along the first dimension.

  • SCATTER The root rank splits its tensor along the first dimension and sends one chunk to each rank.

reduce_op The reduction operation for reduction-type collectives. Can be one of SUM, PROD, MAX, MIN, AVG, or NONE. Must not be NONE for ALL_REDUCE, REDUCE, and REDUCE_SCATTER. For ALL_GATHER, BROADCAST, ALL_TO_ALL, GATHER, and SCATTER, reduce_op must be NONE.

root The root rank for root-based operations (BROADCAST, REDUCE, GATHER, SCATTER). Must be >= 0 for those operations. Use -1 for collectives that do not use a root (ALL_REDUCE, ALL_GATHER, REDUCE_SCATTER, ALL_TO_ALL).

nb_rank The number of ranks participating in the collective (default: 1). When > 1, enables multi-device execution. It is required for correct shapes and validation:

  • ALL_GATHER and GATHER: output leading dimension is nb_rank * input.shape[0].

  • REDUCE_SCATTER and SCATTER: output leading dimension is input.shape[0] / nb_rank; input.shape[0] must be divisible by nb_rank.

  • ALL_TO_ALL: input and output share the same shape; input.shape[0] divisible by nb_rank.

groups Optional array of rank IDs defining a communication group. If empty or not provided, all ranks participate.

group_size Number of elements in the groups array. Must be 0 when groups is empty; must be > 0 when groups is non-empty.

Inputs

input tensor of type T. The input tensor to the collective operation.

For BROADCAST, only the root rank supplies a meaningful input; non-root ranks do not send data.

For SCATTER, only the root rank supplies the full tensor to scatter; non-root inputs are not used by NCCL.

For GATHER, every rank provides an input chunk; only the root rank receives the concatenated result (non-root outputs are not written by NCCL, though buffers may still be allocated).

Outputs

output tensor of type T.

For REDUCE, only the root rank has a meaningful output; non-root ranks do not receive reduced data.

For ALL_GATHER: output shape is [nb_rank * input.shape[0], input.shape[1], ...].

For GATHER: same leading dimension rule as ALL_GATHER on the root output; see Inputs for non-root behavior.

For REDUCE_SCATTER and SCATTER: output shape is [input.shape[0] / nb_rank, input.shape[1], ...]; input.shape[0] must be divisible by nb_rank.

For ALL_TO_ALL: output has the same shape as input; input.shape[0] must be divisible by nb_rank.

For ALL_REDUCE and BROADCAST, output has the same shape as input. For REDUCE, the root rank’s output has the same shape as each rank’s input tensor.

Data Types

T: float32, float16, bfloat16, float8, int64, int32, int8, uint8, bool

Shape Information

input tensor with shape \([d_0, d_1, ..., d_n]\), \(n \geq 1\).

output for ALL_GATHER and GATHER (root semantics): \([nb\_rank \cdot d_0, d_1, ..., d_n]\).

output for REDUCE_SCATTER and SCATTER: \([d_0 / nb\_rank, d_1, ..., d_n]\) where \(d_0\) must be divisible by nb_rank.

output for ALL_TO_ALL: same shape as input; \(d_0\) must be divisible by nb_rank.

output for ALL_REDUCE, BROADCAST, REDUCE (where applicable): same shape as the participating rank’s input tensor.

DLA Support

Not supported.

Multi-Device Runtime Requirements

To use DistCollective with nb_rank > 1:

  • DistCollective is supported on GPUs with SM >= 80 (Ampere or newer).

  • An NCCL communicator must be initialized and set on the execution context via IExecutionContext::setCommunicator before inference.

  • All participating ranks must execute the same network with synchronized execution calls.

Examples

For a complete multi-device example, refer to sampleDistCollective.

C++ API

For more information about the C++ IDistCollectiveLayer operator, refer to the C++ IDistCollectiveLayer documentation.

Python API

For more information about the Python IDistCollectiveLayer operator, refer to the Python IDistCollectiveLayer documentation.