Collective Communication Methods
Methods on Communicator for collective communication. See
Collective Communication Functions for the corresponding C API.
allreduce
- Communicator.allreduce(sendbuf: Buffer | SupportsDLPack | SupportsCAI, recvbuf: Buffer | SupportsDLPack | SupportsCAI, op: NcclRedOp | CustomRedOp, *, stream: Stream | cuda.core.typing.IsStreamType | int | None = None) None
All-reduce variant of
reduce().Equivalent to
reduce(sendbuf, recvbuf, op, root=None, stream=stream): reduces data across all ranks and stores identical copies in each rank’s recvbuf. Seereduce()for argument semantics.See also
broadcast
- Communicator.broadcast(sendbuf: Buffer | SupportsDLPack | SupportsCAI | Any, recvbuf: Buffer | SupportsDLPack | SupportsCAI, root: int, *, stream: Stream | cuda.core.typing.IsStreamType | int | None = None) None
Copies data from
sendbufon the root rank to all ranks’recvbuf.sendbufis only used on the root rank and is ignored on other ranks.On the root rank, both buffers must have matching data types and
sendcount == recvcount. Element count is inferred fromrecvbuf:count = recvcount. In-place operation occurs whensendbufandrecvbufresolve to the same device memory address.- Parameters:
sendbuf – Source buffer (only used on the root rank).
recvbuf – Destination buffer that will receive the broadcast data.
root – Root rank that broadcasts the data (0 to
nranks - 1).stream – CUDA stream for the operation. Defaults to
None(the default stream).
- Raises:
NcclInvalid – If send and receive buffers have mismatched dtypes, mismatched counts, are on the wrong device, are invalid specifications, or the communicator is not initialized.
See also
reduce
- Communicator.reduce(sendbuf: Buffer | SupportsDLPack | SupportsCAI, recvbuf: Buffer | SupportsDLPack | SupportsCAI | Any, op: NcclRedOp | CustomRedOp, root: int | None = None, *, stream: Stream | cuda.core.typing.IsStreamType | int | None = None) None
Reduces data from all ranks using the specified operation.
Supports two modes. In AllReduce mode (
rootisNone) all ranks receive the reduced result inrecvbuf. In Reduce mode (rootspecified) only the root rank receives the reduced result;recvbufis ignored on other ranks.Both buffers must have matching data types where used. Element count is inferred from
sendbuf:count = sendcount. In AllReduce mode, all ranks must haverecvcount >= sendcount; in Reduce mode, only the root rank requiresrecvcount >= sendcount. In-place operation occurs whensendbufandrecvbufresolve to the same device memory address.- Parameters:
sendbuf – Source buffer containing data to be reduced.
recvbuf – Destination buffer for the reduced result. Only used on the root rank in Reduce mode.
op – Reduction operator (e.g.
SUM,MAX,MIN,AVG,PROD, or aCustomRedOp).root – Root rank that receives the reduced result (0 to
nranks - 1). IfNone, performs an all-reduce. Defaults toNone.stream – CUDA stream for the operation. Defaults to
None(the default stream).
- Raises:
NcclInvalid – If send and receive buffers have mismatched dtypes, mismatched counts, are on the wrong device, are invalid specifications, or the communicator is not initialized.
See also
allgather
- Communicator.allgather(sendbuf: Buffer | SupportsDLPack | SupportsCAI, recvbuf: Buffer | SupportsDLPack | SupportsCAI, *, stream: Stream | cuda.core.typing.IsStreamType | int | None = None) None
All-gather variant of
gather().Equivalent to
gather(sendbuf, recvbuf, root=None, stream=stream): gatherssendcountvalues from each rank and places identical copies of the concatenated result in every rank’s recvbuf. Seegather()for argument semantics.See also
reduce_scatter
- Communicator.reduce_scatter(sendbuf: Buffer | SupportsDLPack | SupportsCAI, recvbuf: Buffer | SupportsDLPack | SupportsCAI, op: NcclRedOp | CustomRedOp, *, stream: Stream | cuda.core.typing.IsStreamType | int | None = None) None
Reduces data from all ranks and scatters the result across ranks.
Each rank receives a different portion of the reduced result: rank
ireceives the i-th block in itsrecvbuf.Both buffers must have matching data types. Element count is inferred from
sendbuf:count = sendcount / nranks.sendcountmust be>= nranksandrecvcountmust be>= count. In-place operation occurs whenrecvbufresolves tosendbuf_address + rank * count.- Parameters:
sendbuf – Source buffer (size
>= nranks * recvcountelements).recvbuf – Destination buffer with
recvcountelements.op – Reduction operator (e.g.
SUM,MAX,MIN,AVG,PROD, or aCustomRedOp).stream – CUDA stream for the operation. Defaults to
None(the default stream).
- Raises:
NcclInvalid – If send and receive buffers have mismatched dtypes,
sendbufis too small, are on the wrong device, are invalid specifications, or the communicator is not initialized.
See also
alltoall
- Communicator.alltoall(sendbuf: Buffer | SupportsDLPack | SupportsCAI, recvbuf: Buffer | SupportsDLPack | SupportsCAI, *, stream: Stream | cuda.core.typing.IsStreamType | int | None = None) None
Each rank sends and receives
countvalues to and from every other rank.Data sent to destination rank
jis taken fromsendbuf + j * countand data received from source rankiis placed atrecvbuf + i * count.Both buffers must have matching data types. Element count is inferred from
sendbuf:count = sendcount / nranks.sendcountmust be>= nranksandrecvcountmust be>= sendcount.- Parameters:
sendbuf – Source buffer (size
>= nranks * countelements).recvbuf – Destination buffer (size
>= nranks * countelements).stream – CUDA stream for the operation. Defaults to
None(the default stream).
- Raises:
NcclInvalid – If send and receive buffers have mismatched dtypes, buffer sizes are incompatible with
nranks, are on the wrong device, are invalid specifications, or the communicator is not initialized.
See also
gather
- Communicator.gather(sendbuf: Buffer | SupportsDLPack | SupportsCAI, recvbuf: Buffer | SupportsDLPack | SupportsCAI | Any, root: int | None = None, *, stream: Stream | cuda.core.typing.IsStreamType | int | None = None) None
Gathers
sendcountvalues from all ranks.Supports two modes. In AllGather mode (
rootisNone) values are gathered from all ranks and identical copies of the result are placed in eachrecvbuf. In Gather mode (rootspecified) values are gathered to the specified root rank only;recvbufis ignored on other ranks.Both buffers must have matching data types where used. Element count is inferred from
sendbuf:count = sendcount. Data from rankiis placed atrecvbuf + i * sendcount. AllGather mode requiresrecvcount >= nranks * sendcounton every rank; Gather mode requires it only on the root rank.In-place operation occurs when
sendbufresolves torecvbuf_address + rank * sendcountin AllGather mode, or torecvbuf_address + root * sendcountin Gather mode.- Parameters:
sendbuf – Source buffer containing
sendcountelements.recvbuf – Destination buffer (size
>= nranks * sendcountelements). In Gather mode, only used on the root rank.root – Root rank that receives the gathered data (0 to
nranks - 1). IfNone, performs an all-gather. Defaults toNone.stream – CUDA stream for the operation. Defaults to
None(the default stream).
- Raises:
NcclInvalid – If send and receive buffers have mismatched dtypes,
recvbufis too small, are on the wrong device, are invalid specifications, or the communicator is not initialized.
See also
scatter
- Communicator.scatter(sendbuf: Buffer | SupportsDLPack | SupportsCAI | Any, recvbuf: Buffer | SupportsDLPack | SupportsCAI, root: int, *, stream: Stream | cuda.core.typing.IsStreamType | int | None = None) None
Scatters data from the root rank to all ranks.
Each rank receives
countelements from the root rank. On the root rank,countelements fromsendbuf + i * countare sent to ranki.sendbufis not used on non-root ranks.On the root rank, both buffers must have matching data types. Element count is inferred from
recvbuf:count = recvcount. The root rank requiressendcount >= nranksandsendcount / nranks == recvcount. In-place operation occurs whenrecvbufresolves tosendbuf_address + root * count.- Parameters:
sendbuf – Source buffer (only used on the root rank, size
>= nranks * countelements).recvbuf – Destination buffer with
countelements.root – Root rank that scatters the data (0 to
nranks - 1).stream – CUDA stream for the operation. Defaults to
None(the default stream).
- Raises:
NcclInvalid – If send and receive buffers have mismatched dtypes,
sendbufis too small on the root rank, are on the wrong device, are invalid specifications, or the communicator is not initialized.
See also
create_pre_mul_sum
- Communicator.create_pre_mul_sum(scalar: int | float | numpy.ndarray | Buffer | SupportsDLPack | SupportsCAI, datatype: NcclDataType | None = None) CustomRedOp
Creates a PreMulSum custom reduction operator.
Performs
output = scalar * sum(inputs)and is useful for averaging (scalar = 1/N) or weighted reductions. The returnedCustomRedOpis tracked by the communicator and may be released explicitly via itsclose()method, or automatically when the communicator is destroyed or aborted.- Parameters:
scalar – Scalar multiplier value. A Python int or float is converted to a NumPy array using host memory. A NumPy array must contain exactly 1 element and uses host memory. An
NcclSupportedBufferis treated as a device buffer with exactly 1 element.datatype – NCCL data type of the scalar and reduction. If
None, it is inferred fromscalar: Pythonintbecomesint64and Pythonfloatbecomesfloat64(NumPy’s natural dtypes); a NumPy array uses the array’s dtype; a device buffer uses the buffer’s dtype.
- Returns:
CustomRedOpfor the PreMulSum operator.- Raises:
NcclInvalid – If the communicator is not initialized; the scalar type is unsupported; the NumPy array or device buffer does not contain exactly 1 element; or the requested datatype does not match a device buffer’s dtype.
See also