DistCollective

Performs distributed collective communication operations across multiple GPUs using NCCL (NVIDIA Collective Communications Library) at runtime.

Attributes

collective_operation The collective operation type. Can be one of:

  • ALL_REDUCE All ranks reduce their input tensors using the specified reduce operation, and all ranks receive the same result.

  • ALL_GATHER All ranks gather tensors from all other ranks, concatenating them along the first dimension.

  • BROADCAST The root rank broadcasts its input tensor to all other ranks.

  • REDUCE All ranks reduce their input tensors to the root rank using the specified reduce operation.

  • REDUCE_SCATTER All ranks reduce their input tensors and scatter the result, with each rank receiving a portion along the first dimension.

reduce_op The reduction operation for reduction-type collectives. Can be one of SUM, PROD, MAX, MIN, AVG, or NONE for non-reduction operations (ALL_GATHER and BROADCAST).

root The root rank for root-based operations (BROADCAST, REDUCE). Must be >= 0 for these operations. Use -1 for operations that do not require a root.

nb_rank The number of ranks participating in the collective (default: 1). When > 1, enables multi-device execution. Required for proper output shape computation for ALL_GATHER and REDUCE_SCATTER. Once set to a value > 1, it cannot be changed.

groups Optional array of rank IDs defining a communication group. If empty or not provided, all ranks participate.

group_size Number of elements in the groups array. Must be 0 when groups is empty; must be > 0 when groups is non-empty.

Inputs

input tensor of type T. The input tensor to the collective operation.

For BROADCAST, only the root rank has input; non-root ranks do not have input tensors.

Outputs

output tensor of type T.

For REDUCE, only the root rank has output; non-root ranks do not have output tensors.

For ALL_GATHER: output shape is [nb_rank * input.shape[0], input.shape[1], ...] — the first dimension is multiplied by nb_rank.

For REDUCE_SCATTER: output shape is [input.shape[0] / nb_rank, input.shape[1], ...] — the first dimension is divided by nb_rank. The first input dimension must be divisible by nb_rank.

For other operations, output has the same shape as input.

Data Types

T: float32, float16, bfloat16, float8`, ``int64, int32, int8, uint8, bool

Shape Information

input tensor with shape \([d_0, d_1, ..., d_n]\), \(n \geq 1\).

output for ALL_GATHER: \([nb\_rank \cdot d_0, d_1, ..., d_n]\).

output for REDUCE_SCATTER: \([d_0 / nb\_rank, d_1, ..., d_n]\) where \(d_0\) must be divisible by nb_rank.

output for ALL_REDUCE, BROADCAST, REDUCE: same shape as input.

DLA Support

Not supported.

Multi-Device Runtime Requirements

To use DistCollective with nb_rank > 1:

  • DistCollective is supported on GPUs with SM >= 80 (Ampere or newer).

  • The engine must be built with PreviewFeature::kMULTIDEVICE_RUNTIME_10_16 enabled in the builder config.

  • An NCCL communicator must be initialized and set on the execution context via IExecutionContext::setCommunicator before inference.

  • All participating ranks must execute the same network with synchronized execution calls.

Examples

For a complete multi-device example, refer to sampleDistCollective.

C++ API

For more information about the C++ IDistCollectiveLayer operator, refer to the C++ IDistCollectiveLayer documentation.

Python API

For more information about the Python IDistCollectiveLayer operator, refer to the Python IDistCollectiveLayer documentation.