Vector Dot Product#

This example shows adding a custom parallelization to a dot product between two vectors.

Since the doct product needs a global reduction, we see independant processing until the last step of an allreduce.

Run the baseline example:

python vector_add_baseline.py

And run the sharded example:

torchrun --nproc-per-node 8 vector_add_sharded.py