########################## Networking Troubleshooting ########################## .. highlight:: shell ***************** Networking issues ***************** IP network interfaces ===================== NCCL auto-detects which network interfaces to use for inter-node communication. If some interfaces are in the UP state but are not able to communicate between nodes, NCCL may try to use them anyway and therefore fail during the init functions or even hang. For information about how to specify which interfaces to use, see the Environment Variables chapter, particularly ``NCCL_SOCKET_IFNAME``. IP ports ======== NCCL opens TCP ports to connect processes together and exchange connection information. To restrict the range of ports used by NCCL, one can set the ``net.ipv4.ip_local_port_range`` property of the Linux kernel. This example restricts NCCL ports to ``50000-51000``: .. code:: shell echo 50000 51000 > /proc/sys/net/ipv4/ip_local_port_range Or, to make this permanent, add a line to ``/etc/sysctl.conf``: .. code:: shell echo "net.ipv4.ip_local_port_range = 50000 51000" >> /etc/sysctl.conf Restricting the port range can be useful to open a corresponding range in the firewall, for example on Google Cloud: .. code:: shell gcloud compute --project=myproject firewall-rules create ncclnet0-ingress --direction=INGRESS --priority=1 --network=ncclnet --action=ALLOW --rules=tcp:50000-51000,22,1024-1039 --destination-ranges=0.0.0.0/0 --target-tags=ncclnet NIC-level diagnostics ===================== Before running NCCL over InfiniBand, it is often useful to run low-level network tests to confirm that the ports are up and operating in the expected configuration (link layer, link rate, and active state). The ``ibstat`` and ``ibstatus`` tools can be used for a quick sanity check: .. code:: shell # Per-port summary view. ibstatus # Detailed per-HCA and per-port view. ibstat Look for ports reporting a healthy link, for example: * Port state is ``Active`` (not ``Down``, ``Init``, or ``Armed``) * Physical state indicates link up (often ``LinkUp``) * The expected link layer is in use (for example, ``InfiniBand`` versus ``Ethernet`` for RoCE setups) and is consistent across all NICs used for transport * The reported ``Rate`` matches what you expect for your fabric, with no unexpected downshift Bandwidth testing with ``ib_write_bw`` -------------------------------------- If no obvious network issues are present, ``ib_write_bw`` can be used to test bandwidth between nodes. Choose two compute nodes to act as server and client. On the server node run: .. code:: shell ib_write_bw -d -a On the client node run: .. code:: shell ib_write_bw -d -a Replace ```` with the HCA device name (for example, ``mlx5_0``) shown in ``ibstat``. When troubleshooting GPUDirect RDMA, it is often useful to compare host-memory results with GPU-memory results. Depending on how ``perftest`` was built, the GPU-memory options may include commands such as: .. code:: shell # GPU memory path ib_write_bw -d mlx5_0 --use_cuda= -a # DMA-BUF path, when supported by the kernel, driver, and perftest build ib_write_bw -d mlx5_0 --use_cuda= --use_cuda_dmabuf -a Platforms such as GB300 may also expose additional data-direct modes in certain ``perftest`` builds. Check ``ib_write_bw --help`` on your system for the exact GPU-memory options supported by that build before comparing results. Latency testing with ``ib_write_lat`` ------------------------------------- Bandwidth can look healthy while tail latency is still poor. If the issue looks like stalls, long tails, or intermittent slowdowns, also measure point-to-point latency: .. code:: shell # Server node ib_write_lat -d # Client node ib_write_lat -d High or unstable latency, especially if bandwidth also fluctuates, can indicate congestion, retransmissions, or a degraded link. RDMA protocol statistics ------------------------ ``rdma statistic`` can help identify retransmissions or packet-loss-like symptoms at the RDMA layer: .. code:: shell rdma statistic Look for counters such as ``rnr_nak_retry_err``, ``packet_seq_err``, ``implied_nak_seq_err``, or ``local_ack_timeout_err`` increasing while a test is running. Non-zero or growing values here often indicate loss, retries, or other transport-level problems. Link health with ``mlxlink`` ---------------------------- If the issue looks like degraded PCIe, C2C, or physical port health, check the link directly with ``mlxlink``: .. code:: shell mlxlink -d Use this to confirm the expected link width, link speed, error state, and overall port health. It is especially useful when bandwidth or latency has degraded without an obvious software configuration problem. The next sections cover network troubleshooting steps specific to InfiniBand and RoCE fabrics. InfiniBand ========== Subnet Manager -------------- InfiniBand fabrics require a Subnet Manager (SM) to be running. Check SM status with: .. code:: shell sudo sminfo If ``sminfo`` fails or shows no SM, ensure the SM is running on at least one node (for example, the ``opensm`` service). Port error counters ------------------- Use ``perfquery`` to check for link errors, symbol errors, or packet discards that may indicate cable or switch issues: .. code:: shell sudo perfquery -x Non-zero values in error counters, especially ``SymbolErrorCounter``, ``LinkErrorRecoveryCounter``, and ``LinkDownedCounter``, may indicate hardware or cabling problems. Connectivity testing -------------------- If ``ibstat`` and ``ibstatus`` do not indicate issues but a connectivity problem is suspected, ``ibping`` can be used to test connectivity between nodes. ``ibping`` requires running a server on the remote node first: .. code:: shell # On the remote node, start the ibping server. sudo ibping -S # On the local node, ping the server using its LID. sudo ibping Comprehensive diagnostics ------------------------- For more thorough fabric-wide diagnostics, which usually require SM access, use: .. code:: shell sudo ibdiagnet This checks topology, routing, and reports errors across the entire fabric. For example, if NCCL works on some rails but repeatedly fails or slows down on one specific rail, run ``ibdiagnet`` and inspect the generated report for missing links, bad ports, or routing inconsistencies affecting the switch ports connected to that rail before retrying the NCCL job. Troubleshooting NCCL errors --------------------------- A common issue seen with InfiniBand is the library not being able to register sufficient pinned memory. In such cases you may see an error like: .. code:: shell NCCL WARN Call to ibv_create_qp failed or: .. code:: shell NCCL WARN Call to ibv_reg_mr failed The solution is to remove the user limits on registering pinned memory. This can be done by adding these lines: .. code:: shell * soft memlock unlimited * hard memlock unlimited to the ``/etc/security/limits.conf`` configuration file or equivalent on your Linux distribution. Note that changes to ``limits.conf`` typically require the user to log out and back in, reboot, or, for Slurm clusters, ensure that jobs are launched with the updated limits. Verify the new limits with ``ulimit -l``. RDMA over Converged Ethernet (RoCE) =================================== RoCE fabrics require different diagnostic tools than InfiniBand to diagnose inter-node problems. Port error counters ------------------- You can use ``ethtool`` to check port counters: .. code:: shell ethtool -S Check for errors, drops, pause frames, or PFC-related counters. Non-zero values in these counters may indicate hardware problems, cabling issues, or lossless fabric misconfiguration. Congestion indicators --------------------- RoCE performance problems are often tied to congestion control or lossless fabric settings rather than outright link failure. In addition to ``ethtool``, inspect the switch and NIC counters that your environment exposes for PFC, ECN, CNP, or queue drops. If these counters increase while latency tails grow or throughput becomes unstable, investigate congestion control and lossless-fabric configuration before changing NCCL. Connectivity testing -------------------- If port-level checks do not indicate any issues but a connectivity problem is still suspected, ``rping`` can be used to test connectivity between nodes. Note that ``rping`` requires a server to be running on the remote node first: .. code:: shell # On the remote node, start the rping server. rping -s -a -V -C 10 # On the local node, ping the server using its IP. rping -c -a -S -V -C 10 Troubleshooting NCCL errors --------------------------- A common issue seen with RoCE is the incorrect GID index being selected for the RoCE v2 NICs. This can result in the following error: .. code:: shell NCCL WARN Call to ibv_modify_qp failed with error Invalid argument With NCCL 2.21 and later the GID index is dynamically selected, but with prior versions the user would need to run: .. code:: shell show_gids and then set ``NCCL_IB_GID_INDEX`` to the GID index for the RoCE v2 GID. With NCCL 2.21 and later releases, this environment variable should *not* be set. Users may also need to set ``NCCL_IB_TC`` when using RoCE-based fabrics. Refer to your vendor's documentation for the values this should be set to.