Runtime and MPI issues
Errors
NCCL calls may return a variety of return codes. Ensure that the return codes are always equal to ncclSuccess. If any call fails and returns a value different from ncclSuccess, setting NCCL_DEBUG to “WARN” will make NCCL print an explicit warning message before returning the error.
Errors are grouped into different categories.
ncclUnhandledCudaError and ncclSystemError indicate that a call to an external library failed.
ncclInvalidArgument and ncclInvalidUsage indicate there was a programming error in the application using NCCL.
In either case, refer to the NCCL warning message to understand how to resolve the problem.
Memory issues
Stack size
NCCL’s graph search algorithm is highly recursive and, especially on MNNVL
systems where many ranks are reachable via CUDA P2P, may temporarily require
more than 2 MB of thread stack during communicator creation. While the default
Linux stack size limit (8 MB) is known to be sufficient, we’ve seen crashes
if the limit is changed to unlimited. Due to an idiosyncrasy of GNU libc
(see the man page of pthread_create(3)), such a setting results in a
decrease of the stack size of NCCL’s background threads to just 2 MB,
which may not be sufficiently large. Use ulimit -s in bash to print the
current limit; if needed, reset it to 8192 KB using ulimit -s 8192 (one
also needs to ensure that the new setting is propagated to other nodes when
launching a multi-node NCCL job). Starting with version 2.28, NCCL queries the
default stack size for newly launched threads and, if necessary, changes it to
a safe value for the current job. We still recommend that users on affected
systems attempt to get the system-wide setting fixed as – however well
intentioned – it is a potentially serious misconfiguration that could have
negative effects extending beyond NCCL jobs.
Unified Memory (UVM)
Starting with version 2.23, NCCL utilizes CUDA memory pools to optimize graph capturing. This feature relies on UVM being available. While UVM may not be on by default in some virtual machine (VM) setups, it can typically be enabled through a configuration change.
File Descriptors
NCCL uses a considerable number of file descriptors when running at scale, so the limits may need to be raised. E.g., a 144-rank job using 16 GIN contexts may require over 32K file descriptors per process.
There is the system-wide limit:
cat /proc/sys/fs/file-max
Default values in the millions are common, and systemd may set it even higher. If, however, the limit has been
artificially lowered (e.g., by a file under /etc/sysctl.d/), then it may need to be increased again:
sysctl -w fs.file-max=2097152
There is also the per-process limit that can be queried using ulimit -n. To raise it permanently, create a
new file under /etc/security/limits.d/ (or edit an existing one), adding a line such as:
* - nofile 131072
This sets both the soft and hard limit for all users to 128K.
Note that raising the system-wide limit or the per-process hard limit needs to be done by the system administrator.
MPI
Before running NCCL with MPI (e.g. mpirun <my_application>), running a simple MPI test can help verify whether the nodes are able to communicate properly.
You can do this in two steps. First, make sure an application can be launched in parallel:
# Open MPI-based implementations:
mpirun -np <number of processes> -N <processes per node> "hostname"
# MPICH-based implementations:
mpirun -np <number of processes> -ppn <processes per node> "hostname"
Second, make sure MPI can be initialized and run a simple reduction:
wget https://raw.githubusercontent.com/pmodels/mpich/main/examples/cpi.c
mpicc -o cpi cpi.c
mpirun -np <number of processes> -N <processes per node> ./cpi
Open MPI based MPIs (e.g. NVIDIA HPC-X)
Many NCCL-based applications are compiled with MPI to utilize its parallel launcher and broadcast mechanisms during startup. In cluster environments, if MPI is not correctly configured, the mpirun command may fail to start applications, hang, or produce errors. The following guidelines will help you troubleshoot common MPI-related startup and connectivity issues. These settings assume an environment in which variables are automatically forwarded to each MPI rank (e.g. SLURM cluster). If you are unsure you can explicitly forward the variables through mpirun -x VARIABLE_NAME=<variable_value> instead of export VARIABLE_NAME=<variable_value>.
These settings will not have any impact on NCCL performance, but if MPI is used frequently for communications, then application performance may be impacted.
Network interface selection
If the application hangs at startup or displays a segmentation fault in libmpi.so, MPI may be selecting an incorrect network interface. You can list active and connected interfaces with:
ip -br link | grep LOWER_UP | grep ' UP '
Usually, only a subset of interfaces (such as eth*, en*, or ib*) are connected to the network. Loopback (lo) and container-related interfaces are typically not suitable. If your administrator has specified NCCL_SOCKET_IFNAME, use the same interface with MPI by setting:
export OMPI_MCA_btl_tcp_if_include=<interface-name>
Alternatively, to exclude interfaces that are usually not connected to the network (used for loopback or containers):
export OMPI_MCA_btl_tcp_if_exclude=lo,docker0,virbr0
Note: Do not use include and exclude options simultaneously.
PMIx Data Store selection
There has been an issue in the past (see https://github.com/open-mpi/ompi/issues/7516) with a PMIx component in Open MPI. This has since been fixed, but it can still occur if your MPI stack is based on an older version. If the application reports an error similar to:
PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c
You can force a different GDS component through export PMIX_MCA_gds=hash.
UCX and HPC-X considerations
HPC-X commonly utilizes the Unified Communication X (UCX) library. If you encounter UCX warnings such as:
UCX WARN network device 'XXX' is not available, please use one or more of: YYY, ...
set the device explicitly:
export UCX_NET_DEVICES=YYY
For UCX error messages like:
UCX ERROR no active messages transport to <no debug data>: Unsupported operation
Error: Failed to resolve UCX endpoint
try simplifying the UCX transport selection:
export UCX_TLS=self,sm,tcp
If necessary, you can disable UCX components and revert to basic TCP communication:
export OMPI_MCA_pml=^ucx
export OMPI_MCA_coll_hcoll_enable=0
export OMPI_MCA_coll=^ucc
export OMPI_MCA_btl=self,tcp