CPU/GPU Bcast
This feature implements the MCAST Bcast algorithm in UCC, which is disabled by default. To activate the algorithm, users must configure the following environment variables:
-x UCC_TL_MLX5_MCAST_NET_DEVICE=<HCA> (e.g., mlx5_0:1)
-x UCC_TL_MLX5_MCAST_ENABLE=1 (Enables MCAST algorithms in TL_MLX5)
-x UCC_TL_MLX5_MIN_TEAM_SIZE=N (Where N is greater than or equal to 2 and less than or equal to the number of processes in the job)
-x UCC_TL_MLX5_TUNE=inf (Sets the maximum priority for all MLX5 algorithms)
Additionally, users should adjust the following Open MPI variables:
-x OMPI_UCC_CL_BASIC_TLS=^sharp,nccl
-x OMPI_UCC_CL_HIER_NODE_LEADERS_SBGP_TLS=^sharp,nccl,shm,cuda
Alternatively, users can customize the algorithm tuning for specific memory types by configuring the UCC_TL_MLX5_TUNE variable:
-x UCC_TL_MLX5_TUNE=bcast:host:inf#cuda,cuda_managed:0 (Sets maximum priority for Bcast algorithms for host memory and disables MLX5 for cuda and cuda managed memory).