Networking Troubleshooting
Networking issues
IP network interfaces
NCCL auto-detects which network interfaces to use for inter-node communication. If some interfaces are in the UP state but are not able to communicate between nodes, NCCL may try to use them anyway and therefore fail during the init functions or even hang.
For information about how to specify which interfaces to use, see the
Environment Variables chapter, particularly NCCL_SOCKET_IFNAME.
IP ports
NCCL opens TCP ports to connect processes together and exchange connection
information. To restrict the range of ports used by NCCL, one can set the
net.ipv4.ip_local_port_range property of the Linux kernel.
This example restricts NCCL ports to 50000-51000:
echo 50000 51000 > /proc/sys/net/ipv4/ip_local_port_range
Or, to make this permanent, add a line to /etc/sysctl.conf:
echo "net.ipv4.ip_local_port_range = 50000 51000" >> /etc/sysctl.conf
Restricting the port range can be useful to open a corresponding range in the firewall, for example on Google Cloud:
gcloud compute --project=myproject firewall-rules create ncclnet0-ingress --direction=INGRESS --priority=1 --network=ncclnet --action=ALLOW --rules=tcp:50000-51000,22,1024-1039 --destination-ranges=0.0.0.0/0 --target-tags=ncclnet
NIC-level diagnostics
Before running NCCL over InfiniBand, it is often useful to run low-level
network tests to confirm that the ports are up and operating in the expected
configuration (link layer, link rate, and active state). The ibstat and
ibstatus tools can be used for a quick sanity check:
# Per-port summary view.
ibstatus
# Detailed per-HCA and per-port view.
ibstat
Look for ports reporting a healthy link, for example:
Port state is
Active(notDown,Init, orArmed)Physical state indicates link up (often
LinkUp)The expected link layer is in use (for example,
InfiniBandversusEthernetfor RoCE setups) and is consistent across all NICs used for transportThe reported
Ratematches what you expect for your fabric, with no unexpected downshift
Bandwidth testing with ib_write_bw
If no obvious network issues are present, ib_write_bw can be used to test
bandwidth between nodes.
Choose two compute nodes to act as server and client. On the server node run:
ib_write_bw -d <device> -a
On the client node run:
ib_write_bw -d <device> <server_hostname_or_ip> -a
Replace <device> with the HCA device name (for example, mlx5_0) shown
in ibstat.
When troubleshooting GPUDirect RDMA, it is often useful to compare host-memory
results with GPU-memory results. Depending on how perftest was built, the
GPU-memory options may include commands such as:
# GPU memory path
ib_write_bw -d mlx5_0 --use_cuda=<gpu_id> <server_hostname_or_ip> -a
# DMA-BUF path, when supported by the kernel, driver, and perftest build
ib_write_bw -d mlx5_0 --use_cuda=<gpu_id> --use_cuda_dmabuf <server_hostname_or_ip> -a
Platforms such as GB300 may also expose additional data-direct modes in certain
perftest builds. Check ib_write_bw --help on your system for the exact
GPU-memory options supported by that build before comparing results.
Latency testing with ib_write_lat
Bandwidth can look healthy while tail latency is still poor. If the issue looks like stalls, long tails, or intermittent slowdowns, also measure point-to-point latency:
# Server node
ib_write_lat -d <device>
# Client node
ib_write_lat -d <device> <server_hostname_or_ip>
High or unstable latency, especially if bandwidth also fluctuates, can indicate congestion, retransmissions, or a degraded link.
RDMA protocol statistics
rdma statistic can help identify retransmissions or packet-loss-like
symptoms at the RDMA layer:
rdma statistic
Look for counters such as rnr_nak_retry_err, packet_seq_err,
implied_nak_seq_err, or local_ack_timeout_err increasing while a test
is running. Non-zero or growing values here often indicate loss, retries, or
other transport-level problems.
Link health with mlxlink
If the issue looks like degraded PCIe, C2C, or physical port health, check the
link directly with mlxlink:
mlxlink -d <mst_device>
Use this to confirm the expected link width, link speed, error state, and overall port health. It is especially useful when bandwidth or latency has degraded without an obvious software configuration problem.
The next sections cover network troubleshooting steps specific to InfiniBand and RoCE fabrics.
InfiniBand
Subnet Manager
InfiniBand fabrics require a Subnet Manager (SM) to be running. Check SM status with:
sudo sminfo
If sminfo fails or shows no SM, ensure the SM is running on at least one
node (for example, the opensm service).
Port error counters
Use perfquery to check for link errors, symbol errors, or packet discards
that may indicate cable or switch issues:
sudo perfquery -x <lid>
Non-zero values in error counters, especially SymbolErrorCounter,
LinkErrorRecoveryCounter, and LinkDownedCounter, may indicate hardware
or cabling problems.
Connectivity testing
If ibstat and ibstatus do not indicate issues but a connectivity
problem is suspected, ibping can be used to test connectivity between
nodes. ibping requires running a server on the remote node first:
# On the remote node, start the ibping server.
sudo ibping -S
# On the local node, ping the server using its LID.
sudo ibping <lid from ibstat>
Comprehensive diagnostics
For more thorough fabric-wide diagnostics, which usually require SM access, use:
sudo ibdiagnet
This checks topology, routing, and reports errors across the entire fabric.
For example, if NCCL works on some rails but repeatedly fails or slows down on
one specific rail, run ibdiagnet and inspect the generated report for
missing links, bad ports, or routing inconsistencies affecting the switch ports
connected to that rail before retrying the NCCL job.
Troubleshooting NCCL errors
A common issue seen with InfiniBand is the library not being able to register sufficient pinned memory. In such cases you may see an error like:
NCCL WARN Call to ibv_create_qp failed
or:
NCCL WARN Call to ibv_reg_mr failed
The solution is to remove the user limits on registering pinned memory. This can be done by adding these lines:
* soft memlock unlimited
* hard memlock unlimited
to the /etc/security/limits.conf configuration file or equivalent on your
Linux distribution. Note that changes to limits.conf typically require the
user to log out and back in, reboot, or, for Slurm clusters, ensure that jobs
are launched with the updated limits. Verify the new limits with ulimit -l.
RDMA over Converged Ethernet (RoCE)
RoCE fabrics require different diagnostic tools than InfiniBand to diagnose inter-node problems.
Port error counters
You can use ethtool to check port counters:
ethtool -S <nic_name>
Check for errors, drops, pause frames, or PFC-related counters. Non-zero values in these counters may indicate hardware problems, cabling issues, or lossless fabric misconfiguration.
Congestion indicators
RoCE performance problems are often tied to congestion control or lossless
fabric settings rather than outright link failure. In addition to ethtool,
inspect the switch and NIC counters that your environment exposes for PFC, ECN,
CNP, or queue drops. If these counters increase while latency tails grow or
throughput becomes unstable, investigate congestion control and lossless-fabric
configuration before changing NCCL.
Connectivity testing
If port-level checks do not indicate any issues but a connectivity problem is
still suspected, rping can be used to test connectivity between nodes. Note
that rping requires a server to be running on the remote node first:
# On the remote node, start the rping server.
rping -s -a <ip_address_of_the_server_nic> -V -C 10
# On the local node, ping the server using its IP.
rping -c -a <ip_address_of_the_server_nic> -S <ip_address_of_the_client_nic> -V -C 10
Troubleshooting NCCL errors
A common issue seen with RoCE is the incorrect GID index being selected for the RoCE v2 NICs. This can result in the following error:
NCCL WARN Call to ibv_modify_qp failed with error Invalid argument
With NCCL 2.21 and later the GID index is dynamically selected, but with prior versions the user would need to run:
show_gids
and then set NCCL_IB_GID_INDEX to the GID index for the RoCE v2 GID. With
NCCL 2.21 and later releases, this environment variable should not be set.
Users may also need to set NCCL_IB_TC when using RoCE-based fabrics. Refer
to your vendor’s documentation for the values this should be set to.