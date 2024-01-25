Implements a distributed AllReduceV primitive. It is based on the idea of a single global tensor which which can be distributed along a specified dimension into chunks of variable size. This primitive assumes different global tensors of the same shape on each rank. It then re-distributes chunks of all these tensors such that each rank receives all corresponding parts of a global tensor. Each rank then sums up the chunks after receiving it. By design, this primitive thus implements the backward pass of the “all_gather_v” primitive. In this case, the result would be a single global gradient tensor distributed onto different ranks.

Parameters tensor ( torch.Tensor ) – global tensor on each rank (different one on each rank)

sizes ( List [ int ] ) – list of the sizes of each chunk on each rank along distributed dimension, valid and set on each rank

dim ( int , optional ) – dimension along which global tensor is distributed, by default 0

use_fp32 ( bool , optional ) – flag to specify FP32 precision for the redcution, by default True

group (Optional[dist.ProcessGroup], optional) – process group along which global tensor is shared, by default None Returns local tensor, i.e. result of reduction of all corresponding chunks from all global tensors for each rank separately Return type torch.Tensor