NCCL-RDMA-SHARP plugins enable RDMA and switch-based collectives (SHARP) with NVIDIA's NCCL library.
This plugin replaces the default NCCL internal inter-node communication with RDMA-based transports. It implements both Point-to-Point transport(Net) (IB verbs (default) and UCX), and Collective transport(CollNet) (including SHARP Collective transport).
NCCL UCX Plugin
NCCL UCX plugin (if enabled) replaces the default NCCL verbs-based inter-node communication routines with UCX-based communication routines.
Running NCCL UCX Plugin
To use NCCL UCX plugin:
For NCCL to detect the network plugin, make sure to add
plugin_install_dirto the library search path environment variable, as shown below.
Enable UCX plugin by defining NCCL_PLUGIN_P2P=ucx environment variable.
To achieve the ultimate performance, various UCX parameters can be used depending on the server's hardware configuration.
The below is an example of a hardware configuration where the GPU and the NIC share the same PCIe switch. In such a scenario, GPU Direct RDMA gives the best possible performance.
To use GPU Direct RDMA for all message sizes in UCX:
Define the following environment variables as shown.
Note that for servers with multiple NICs available, you need to define the following additional variable.
By default, NCCL is built as a static library to enable portability. In such a case, you may experience plugin-related wrong memory type detection and plugin program failures. In order to avoid this, explicitly disable memory type cache feature in UCX by defining the
UCX_MEMTYPE_CACHE environment variable as follows.
NCCL Tests Benchmark Example
NCCL tests can be used for NCCL-UCX performance benchmarking (visit https://github.com/nvidia/nccl-tests to run the benchmark).
NCCL SHARP Plugin
The following environment variables enable the SHARP aggregation with NCCL when using the plugin.
Mellanox switches allow a limited number of streaming aggregation flows (maximum: 2). On systems with multiple GPUs and multiple HCAs, NCCL creates an aggregation streaming flow (NCCL Ring/Channel) per HCA rail. It is required to build the cluster topology in such a way that leaf level switches are connected to the same HCA rail from each server.
NCCL Test Benchmark Example
The sanity performance of the setup can be verified with NCCL tests. Please refer to NCCL tests here: https://github.com/NVIDIA/nccl-tests.