Networking Troubleshooting

Networking issues

IP network interfaces

NCCL auto-detects which network interfaces to use for inter-node communication. If some interfaces are in the UP state but are not able to communicate between nodes, NCCL may try to use them anyway and therefore fail during the init functions or even hang.

For information about how to specify which interfaces to use, see the Environment Variables chapter, particularly NCCL_SOCKET_IFNAME.

IP ports

NCCL opens TCP ports to connect processes together and exchange connection information. To restrict the range of ports used by NCCL, one can set the net.ipv4.ip_local_port_range property of the Linux kernel.

This example restricts NCCL ports to 50000-51000:

echo 50000 51000 > /proc/sys/net/ipv4/ip_local_port_range

Or, to make this permanent, add a line to /etc/sysctl.conf:

echo "net.ipv4.ip_local_port_range = 50000 51000" >> /etc/sysctl.conf

Restricting the port range can be useful to open a corresponding range in the firewall, for example on Google Cloud:

gcloud compute --project=myproject firewall-rules create ncclnet0-ingress --direction=INGRESS --priority=1 --network=ncclnet --action=ALLOW --rules=tcp:50000-51000,22,1024-1039 --destination-ranges=0.0.0.0/0 --target-tags=ncclnet

NIC-level diagnostics

Before running NCCL over InfiniBand, it is often useful to run low-level network tests to confirm that the ports are up and operating in the expected configuration (link layer, link rate, and active state). The ibstat and ibstatus tools can be used for a quick sanity check:

# Per-port summary view.
ibstatus

# Detailed per-HCA and per-port view.
ibstat

Look for ports reporting a healthy link, for example:

  • Port state is Active (not Down, Init, or Armed)

  • Physical state indicates link up (often LinkUp)

  • The expected link layer is in use (for example, InfiniBand versus Ethernet for RoCE setups) and is consistent across all NICs used for transport

  • The reported Rate matches what you expect for your fabric, with no unexpected downshift

Bandwidth testing with ib_write_bw

If no obvious network issues are present, ib_write_bw can be used to test bandwidth between nodes.

Choose two compute nodes to act as server and client. On the server node run:

ib_write_bw -d <device> -a

On the client node run:

ib_write_bw -d <device> <server_hostname_or_ip> -a

Replace <device> with the HCA device name (for example, mlx5_0) shown in ibstat.

When troubleshooting GPUDirect RDMA, it is often useful to compare host-memory results with GPU-memory results. Depending on how perftest was built, the GPU-memory options may include commands such as:

# GPU memory path
ib_write_bw -d mlx5_0 --use_cuda=<gpu_id> <server_hostname_or_ip> -a

# DMA-BUF path, when supported by the kernel, driver, and perftest build
ib_write_bw -d mlx5_0 --use_cuda=<gpu_id> --use_cuda_dmabuf <server_hostname_or_ip> -a

Platforms such as GB300 may also expose additional data-direct modes in certain perftest builds. Check ib_write_bw --help on your system for the exact GPU-memory options supported by that build before comparing results.

Latency testing with ib_write_lat

Bandwidth can look healthy while tail latency is still poor. If the issue looks like stalls, long tails, or intermittent slowdowns, also measure point-to-point latency:

# Server node
ib_write_lat -d <device>

# Client node
ib_write_lat -d <device> <server_hostname_or_ip>

High or unstable latency, especially if bandwidth also fluctuates, can indicate congestion, retransmissions, or a degraded link.

RDMA protocol statistics

rdma statistic can help identify retransmissions or packet-loss-like symptoms at the RDMA layer:

rdma statistic

Look for counters such as rnr_nak_retry_err, packet_seq_err, implied_nak_seq_err, or local_ack_timeout_err increasing while a test is running. Non-zero or growing values here often indicate loss, retries, or other transport-level problems.

InfiniBand

Subnet Manager

InfiniBand fabrics require a Subnet Manager (SM) to be running. Check SM status with:

sudo sminfo

If sminfo fails or shows no SM, ensure the SM is running on at least one node (for example, the opensm service).

Port error counters

Use perfquery to check for link errors, symbol errors, or packet discards that may indicate cable or switch issues:

sudo perfquery -x <lid>

Non-zero values in error counters, especially SymbolErrorCounter, LinkErrorRecoveryCounter, and LinkDownedCounter, may indicate hardware or cabling problems.

Connectivity testing

If ibstat and ibstatus do not indicate issues but a connectivity problem is suspected, ibping can be used to test connectivity between nodes. ibping requires running a server on the remote node first:

# On the remote node, start the ibping server.
sudo ibping -S

# On the local node, ping the server using its LID.
sudo ibping <lid from ibstat>

Comprehensive diagnostics

For more thorough fabric-wide diagnostics, which usually require SM access, use:

sudo ibdiagnet

This checks topology, routing, and reports errors across the entire fabric.

For example, if NCCL works on some rails but repeatedly fails or slows down on one specific rail, run ibdiagnet and inspect the generated report for missing links, bad ports, or routing inconsistencies affecting the switch ports connected to that rail before retrying the NCCL job.

Troubleshooting NCCL errors

A common issue seen with InfiniBand is the library not being able to register sufficient pinned memory. In such cases you may see an error like:

NCCL WARN Call to ibv_create_qp failed

or:

NCCL WARN Call to ibv_reg_mr failed

The solution is to remove the user limits on registering pinned memory. This can be done by adding these lines:

* soft memlock unlimited
* hard memlock unlimited

to the /etc/security/limits.conf configuration file or equivalent on your Linux distribution. Note that changes to limits.conf typically require the user to log out and back in, reboot, or, for Slurm clusters, ensure that jobs are launched with the updated limits. Verify the new limits with ulimit -l.

RDMA over Converged Ethernet (RoCE)

RoCE fabrics require different diagnostic tools than InfiniBand to diagnose inter-node problems.

Port error counters

You can use ethtool to check port counters:

ethtool -S <nic_name>

Check for errors, drops, pause frames, or PFC-related counters. Non-zero values in these counters may indicate hardware problems, cabling issues, or lossless fabric misconfiguration.

Congestion indicators

RoCE performance problems are often tied to congestion control or lossless fabric settings rather than outright link failure. In addition to ethtool, inspect the switch and NIC counters that your environment exposes for PFC, ECN, CNP, or queue drops. If these counters increase while latency tails grow or throughput becomes unstable, investigate congestion control and lossless-fabric configuration before changing NCCL.

Connectivity testing

If port-level checks do not indicate any issues but a connectivity problem is still suspected, rping can be used to test connectivity between nodes. Note that rping requires a server to be running on the remote node first:

# On the remote node, start the rping server.
rping -s -a <ip_address_of_the_server_nic> -V -C 10

# On the local node, ping the server using its IP.
rping -c -a <ip_address_of_the_server_nic> -S <ip_address_of_the_client_nic> -V -C 10

Troubleshooting NCCL errors

A common issue seen with RoCE is the incorrect GID index being selected for the RoCE v2 NICs. This can result in the following error:

NCCL WARN Call to ibv_modify_qp failed with error Invalid argument

With NCCL 2.21 and later the GID index is dynamically selected, but with prior versions the user would need to run:

show_gids

and then set NCCL_IB_GID_INDEX to the GID index for the RoCE v2 GID. With NCCL 2.21 and later releases, this environment variable should not be set.

Users may also need to set NCCL_IB_TC when using RoCE-based fabrics. Refer to your vendor’s documentation for the values this should be set to.