→ torch.Tensor

Linear layer execution with asynchronous communication and gradient accumulation fusion in backprop.

This has the option to accumulate the result of backprop calculation into an existing gradient buffer, preventing the need to do an additional addition kernel after the gradient calculation.

Additionally, the tensor parallel all reduce of the input gradients can be done asynchronously with the calculation of the weight gradients.

In the case of sequence parallelism, the reduce scatter of the input gradients is done asynchronously with the calculation of the weight gradients.

Use of this module requires that the environment variable CUDA_DEVICE_MAX_CONNECTIONS=1. There are a few collective operations, noted in the code, that should be scheduled before compute kernels to overlap the communication with the computation, which is necessary for a speedup but not for correctness so that ordering isn’t imposed by the scheduler. Setting CUDA_DEVICE_MAX_CONNECTIONS=1 forces the kernels to be scheduled in the order they are called.