Runtime and MPI issues

Errors

NCCL calls may return a variety of return codes. Ensure that the return codes are always equal to ncclSuccess. If any call fails and returns a value different from ncclSuccess, setting NCCL_DEBUG to “WARN” will make NCCL print an explicit warning message before returning the error.

Errors are grouped into different categories.

ncclUnhandledCudaError and ncclSystemError indicate that a call to an external library failed.
ncclInvalidArgument and ncclInvalidUsage indicate there was a programming error in the application using NCCL.

In either case, refer to the NCCL warning message to understand how to resolve the problem.

Memory issues

Shared memory

To communicate between processes and even between threads of a process, NCCL creates shared memory segments, traditionally in /dev/shm. The operating system’s limits on these resources may need to be increased accordingly. Please see your system’s documentation for details.

If insufficient shared memory is available, NCCL will fail to initialize. Running with NCCL_DEBUG=WARN will show a message similar to this:

NCCL WARN Error: failed to extend /dev/shm/nccl-03v824 to 4194660 bytes

Docker

In particular, Docker containers default to limited shared and pinned memory resources. When using NCCL inside a container, please make sure to adjust the shared memory size inside the container, for example by adding the following arguments to the docker launch command line:

--shm-size=1g --ulimit memlock=-1

Systemd

When running jobs using mpirun or SLURM, systemd may remove files in shared memory when it detects that the corresponding user is not logged in, in an attempt to clean up old temporary files. This can cause NCCL to crash during init with an error like:

NCCL WARN unlink shared memory /dev/shm/nccl-d5rTd0 failed, error: No such file or directory

Given mpirun and SLURM jobs can run on the node without the user being seen as logged in by systemd, system administrators need to disable that clean-up mechanism, which can be performed by SLURM epilogue scripts instead. To do this, the following line needs to be set in /etc/systemd/logind.conf:

RemoveIPC=no

Once updated, the daemons should be restarted with:

sudo systemctl restart systemd-logind

cuMem host allocations

Starting with version 2.23, NCCL supports an alternative shared memory mechanism using cuMem host allocations. From NCCL 2.24, if CUDA driver >= 12.6 and CUDA runtime >= 12.2, it is enabled by default in favor of /dev/shm.

However, cuMem host allocations rely on correctly configured and working NUMA support, which may not be available in some VM and containerization scenarios. In particular, Docker by default disables NUMA support (it can be enabled by invoking Docker with --cap-add SYS_NICE). From version 2.26.5, NCCL checks if cuMem host allocations work and, if needed, automatically falls back to the /dev/shm code. In prior versions, the same outcome can be achieved by manually specifying NCCL_CUMEM_HOST_ENABLE=0. We still recommend configuring the underlying system to ensure that cuMem host allocations work, as they provide improved reliability during communicator aborts.

cuMem host allocations may fail on systems without CUDA P2P connectivity if CUDA driver version prior to 13.0 is being used. Furthermore, CUDA Forward Compatibility feature can affect NCCL’s ability to accurately determine the current driver version, resulting in cuMem host allocations being enabled on older drivers than intended. We continue to investigate additional mechanisms to detect such circumstances; in the meantime, use NCCL_CUMEM_HOST_ENABLE=0 to deactivate this feature if it causes issues.

Stack size

NCCL’s graph search algorithm is highly recursive and, especially on MNNVL systems where many ranks are reachable via CUDA P2P, may temporarily require more than 2 MB of thread stack during communicator creation. While the default Linux stack size limit (8 MB) is known to be sufficient, we’ve seen crashes if the limit is changed to unlimited. Due to an idiosyncrasy of GNU libc (see the man page of pthread_create(3)), such a setting results in a decrease of the stack size of NCCL’s background threads to just 2 MB, which may not be sufficiently large. Use ulimit -s in bash to print the current limit; if needed, reset it to 8192 KB using ulimit -s 8192 (one also needs to ensure that the new setting is propagated to other nodes when launching a multi-node NCCL job). Starting with version 2.28, NCCL queries the default stack size for newly launched threads and, if necessary, changes it to a safe value for the current job. We still recommend that users on affected systems attempt to get the system-wide setting fixed as – however well intentioned – it is a potentially serious misconfiguration that could have negative effects extending beyond NCCL jobs.

Unified Memory (UVM)

Starting with version 2.23, NCCL utilizes CUDA memory pools to optimize graph capturing. This feature relies on UVM being available. While UVM may not be on by default in some virtual machine (VM) setups, it can typically be enabled through a configuration change.

File Descriptors

NCCL uses a considerable number of file descriptors when running at scale, so the limits may need to be raised. E.g., a 144-rank job using 16 GIN contexts may require over 32K file descriptors per process.

There is the system-wide limit:

cat /proc/sys/fs/file-max

Default values in the millions are common, and systemd may set it even higher. If, however, the limit has been artificially lowered (e.g., by a file under /etc/sysctl.d/), then it may need to be increased again:

sysctl -w fs.file-max=2097152

There is also the per-process limit that can be queried using ulimit -n. To raise it permanently, create a new file under /etc/security/limits.d/ (or edit an existing one), adding a line such as:

* - nofile 131072

This sets both the soft and hard limit for all users to 128K.

Note that raising the system-wide limit or the per-process hard limit needs to be done by the system administrator.

MPI

Before running NCCL with MPI (e.g. mpirun <my_application>), running a simple MPI test can help verify whether the nodes are able to communicate properly.

You can do this in two steps. First, make sure an application can be launched in parallel:

# Open MPI-based implementations:
mpirun -np <number of processes> -N <processes per node> "hostname"

# MPICH-based implementations:
mpirun -np <number of processes> -ppn <processes per node> "hostname"

Second, make sure MPI can be initialized and run a simple reduction:

wget https://raw.githubusercontent.com/pmodels/mpich/main/examples/cpi.c
mpicc -o cpi cpi.c
mpirun -np <number of processes> -N <processes per node> ./cpi

Open MPI based MPIs (e.g. NVIDIA HPC-X)

Many NCCL-based applications are compiled with MPI to utilize its parallel launcher and broadcast mechanisms during startup. In cluster environments, if MPI is not correctly configured, the mpirun command may fail to start applications, hang, or produce errors. The following guidelines will help you troubleshoot common MPI-related startup and connectivity issues. These settings assume an environment in which variables are automatically forwarded to each MPI rank (e.g. SLURM cluster). If you are unsure you can explicitly forward the variables through mpirun -x VARIABLE_NAME=<variable_value> instead of export VARIABLE_NAME=<variable_value>.

These settings will not have any impact on NCCL performance, but if MPI is used frequently for communications, then application performance may be impacted.

Network interface selection

If the application hangs at startup or displays a segmentation fault in libmpi.so, MPI may be selecting an incorrect network interface. You can list active and connected interfaces with:

ip -br link | grep LOWER_UP | grep ' UP '

Usually, only a subset of interfaces (such as eth*, en*, or ib*) are connected to the network. Loopback (lo) and container-related interfaces are typically not suitable. If your administrator has specified NCCL_SOCKET_IFNAME, use the same interface with MPI by setting:

export OMPI_MCA_btl_tcp_if_include=<interface-name>

Alternatively, to exclude interfaces that are usually not connected to the network (used for loopback or containers):

export OMPI_MCA_btl_tcp_if_exclude=lo,docker0,virbr0

Note: Do not use include and exclude options simultaneously.

PMIx Data Store selection

There has been an issue in the past (see https://github.com/open-mpi/ompi/issues/7516) with a PMIx component in Open MPI. This has since been fixed, but it can still occur if your MPI stack is based on an older version. If the application reports an error similar to:

PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c

You can force a different GDS component through export PMIX_MCA_gds=hash.

UCX and HPC-X considerations

HPC-X commonly utilizes the Unified Communication X (UCX) library. If you encounter UCX warnings such as:

UCX  WARN  network device 'XXX' is not available, please use one or more of: YYY, ...

set the device explicitly:

export UCX_NET_DEVICES=YYY

For UCX error messages like:

UCX  ERROR   no active messages transport to <no debug data>: Unsupported operation
Error: Failed to resolve UCX endpoint

try simplifying the UCX transport selection:

export UCX_TLS=self,sm,tcp

If necessary, you can disable UCX components and revert to basic TCP communication:

export OMPI_MCA_pml=^ucx
export OMPI_MCA_coll_hcoll_enable=0
export OMPI_MCA_coll=^ucc
export OMPI_MCA_btl=self,tcp