Multi-GPU Support
The feature enables parallelization techniques involving multiple CUDA GPUs within a single process.
The feature enables parallelization techniques involving multiple CUDA GPUs within a single process in the general case, and hybrid MPI techniques in particular: MPI + OpenACC/OpenMP/stdpar; allowing a single MPI rank to manage more than one CUDA GPU.
The implementation is done in UCX library. Thus, the use of multiple CUDA GPUs is supported in applications that use UCX library directly, as well as in MPI applications.
Along with this functionality, the requirements for setting the CUDA device (cudaSetDevice
) in each CPU thread of the application process using UCX library interfaces have also been relaxed. Previously, the user had to set the CUDA device in all threads, including the progress thread. Now, the user may set the CUDA device only once to utilize the CUDA features in UCX. Since
CUDA devices in CUDA Runtime API and CUcontext
s in CUDA Driver API
are synonymous, this is also true for applications using CUDA Driver API.
By default, UCX library uses closest NICs to the CUDA GPU which is set before the initialization of the library. This is the optimal policy for applications using one CUDA GPU per process. The following combination of UCX parameters can be used to achieve the best performance for applications using multiple CUDA GPUs.
UCX_SELECT_DISTANCE_MD=
UCX_CONNECT_ALL_TO_ALL=y
The feature is not supported for memory allocated VMM API if the device on which the memory is allocated and the device for which access rights are set do not match.