NCCL UCX plugin (if enabled) replaces the default NCCL verbs-based inter-node communication routines with UCX-based communication routines.

To use NCCL UCX plugin:

For NCCL to detect the network plugin, make sure to add plugin_install_dir to the library search path environment variable, as shown below. Copy Copied! # libnccl_net.so is in <plugin_install_dir>/lib $ export LD_LIBRARY_PATH=<plugin_install_dir>/lib:$LD_LIBRARY_PATH $ <run command> Enable UCX plugin by defining NCCL_PLUGIN_P2P=ucx environment variable. Copy Copied! $ export NCCL_PLUGIN_P2P=ucx $ <run command>

To achieve the ultimate performance, various UCX parameters can be used depending on the server's hardware configuration.

The below is an example of a hardware configuration where the GPU and the NIC share the same PCIe switch. In such a scenario, GPU Direct RDMA gives the best possible performance.

To use GPU Direct RDMA for all message sizes in UCX:

Define the following environment variables as shown.

Copy Copied! $ export NCCL_UCX_RNDV_THRESH= 0 $ export NCCL_UCX_RNDV_SCHEME=get_zcopy $ <run command>

Note that for servers with multiple NICs available, you need to define the following additional variable.

Copy Copied! $ export NCCL_UCX_TLS=dc,cuda_copy,cuda_ipc $ <run command>

Note By default, NCCL is built as a static library to enable portability. In such a case, you may experience plugin-related wrong memory type detection and plugin program failures. In order to avoid this, explicitly disable memory type cache feature in UCX by defining the UCX_MEMTYPE_CACHE environment variable as follows. Copy Copied! $ export UCX_MEMTYPE_CACHE=n $ <run command>

NCCL tests can be used for NCCL-UCX performance benchmarking (visit https://github.com/nvidia/nccl-tests to run the benchmark).

Example: