DistCollective¶
Performs distributed collective communication operations across multiple GPUs using NCCL (NVIDIA Collective Communications Library) at runtime.
See also
Attributes¶
collective_operation The collective operation type. Can be one of:
ALL_REDUCEAll ranks reduce their input tensors using the specified reduce operation, and all ranks receive the same result.ALL_GATHERAll ranks gather tensors from all other ranks, concatenating them along the first dimension.BROADCASTThe root rank broadcasts its input tensor to all other ranks.REDUCEAll ranks reduce their input tensors to the root rank using the specified reduce operation.REDUCE_SCATTERAll ranks reduce their input tensors and scatter the result, with each rank receiving a portion along the first dimension.
reduce_op The reduction operation for reduction-type collectives. Can be one of SUM, PROD, MAX, MIN, AVG, or NONE for non-reduction operations (ALL_GATHER and BROADCAST).
root The root rank for root-based operations (BROADCAST, REDUCE). Must be >= 0 for these operations. Use -1 for operations that do not require a root.
nb_rank The number of ranks participating in the collective (default: 1). When > 1, enables multi-device execution. Required for proper output shape computation for ALL_GATHER and REDUCE_SCATTER. Once set to a value > 1, it cannot be changed.
groups Optional array of rank IDs defining a communication group. If empty or not provided, all ranks participate.
group_size Number of elements in the groups array. Must be 0 when groups is empty; must be > 0 when groups is non-empty.
Inputs¶
input tensor of type T. The input tensor to the collective operation.
For BROADCAST, only the root rank has input; non-root ranks do not have input tensors.
Outputs¶
output tensor of type T.
For REDUCE, only the root rank has output; non-root ranks do not have output tensors.
For ALL_GATHER: output shape is [nb_rank * input.shape[0], input.shape[1], ...] — the first dimension is multiplied by nb_rank.
For REDUCE_SCATTER: output shape is [input.shape[0] / nb_rank, input.shape[1], ...] — the first dimension is divided by nb_rank. The first input dimension must be divisible by nb_rank.
For other operations, output has the same shape as input.
Data Types¶
T: float32, float16, bfloat16, float8`, ``int64, int32, int8, uint8, bool
Shape Information¶
input tensor with shape \([d_0, d_1, ..., d_n]\), \(n \geq 1\).
output for ALL_GATHER: \([nb\_rank \cdot d_0, d_1, ..., d_n]\).
output for REDUCE_SCATTER: \([d_0 / nb\_rank, d_1, ..., d_n]\) where \(d_0\) must be divisible by nb_rank.
output for ALL_REDUCE, BROADCAST, REDUCE: same shape as input.
DLA Support¶
Not supported.
Multi-Device Runtime Requirements¶
To use DistCollective with nb_rank > 1:
DistCollective is supported on GPUs with SM >= 80 (Ampere or newer).
The engine must be built with
PreviewFeature::kMULTIDEVICE_RUNTIME_10_16enabled in the builder config.An NCCL communicator must be initialized and set on the execution context via
IExecutionContext::setCommunicatorbefore inference.All participating ranks must execute the same network with synchronized execution calls.
Examples¶
For a complete multi-device example, refer to sampleDistCollective.
C++ API¶
For more information about the C++ IDistCollectiveLayer operator, refer to the C++ IDistCollectiveLayer documentation.
Python API¶
For more information about the Python IDistCollectiveLayer operator, refer to the Python IDistCollectiveLayer documentation.