################### GPU troubleshooting ################### .. highlight:: shell ********** GPU Direct ********** NCCL heavily relies on GPU Direct for inter-GPU communication. This refers to the ability for a GPU to directly communicate with another device, such as another GPU or a network card, using direct point-to-point PCI messages. Direct point-to-point PCI messages can fail or perform poorly for a variety of reasons, like missing components, a bad configuration of a virtual machine or a container, or some BIOS settings. GPU-to-GPU communication ======================== **Peer-to-peer GPU memory access.** For GPU-to-GPU traffic, NCCL favors *peer-to-peer* transport when CUDA reports that GPUs can access each other's memory directly (typically over NVLink, or over PCIe when the topology and driver allow it). **Checking peer-to-peer GPU memory access.** You can use ``nvidia-smi topo -p2p `` to print a matrix of P2P status between GPU pairs. The ```` value is ``p`` for PCIe and ``n`` for NVLink. For example, on a healthy PCI Express-based 8-GPU system with full peer-to-peer GPU memory access, the matrix from ``nvidia-smi topo -p2p p`` typically looks like the example below: .. code-block:: text GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 GPU0 X OK OK OK OK OK OK OK GPU1 OK X OK OK OK OK OK OK GPU2 OK OK X OK OK OK OK OK GPU3 OK OK OK X OK OK OK OK GPU4 OK OK OK OK X OK OK OK GPU5 OK OK OK OK OK X OK OK GPU6 OK OK OK OK OK OK X OK GPU7 OK OK OK OK OK OK OK X If peer-to-peer GPU memory access does not work as expected with NCCL even when ``nvidia-smi topo -p2p`` shows ``OK`` for peer access on the GPU pairs you use, one common cause is **PCI Access Control Services (ACS)**; see :ref:`troubleshooting_acs`. **CUDA sample.** The `cuda-samples `__ repository includes a useful program for checking peer-to-peer GPU memory access. Follow the instructions there to build the sample. **simpleP2P** (``Samples/0_Introduction/simpleP2P``) checks the transferred data after each transfer to confirm that the copy completed successfully. A successful run prints additional lines earlier in the log; the excerpt below is truncated and shows only the final lines: .. code-block:: text ⋮ Preparing host buffer and memcpy to GPU0... Run kernel on GPU1, taking source data from GPU0 and writing to GPU1... Run kernel on GPU0, taking source data from GPU1 and writing to GPU0... Copy data back to host from GPU0 and verify results... Disabling peer access... Shutting down... Test passed The last line of the run should read ``Test passed``. To measure the available bandwidth between GPUs, we recommend using ``nvbandwidth`` rather than the CUDA sample above. Download and build it following the instructions at https://github.com/NVIDIA/nvbandwidth. **Disabling peer-to-peer GPU memory access for testing.** To see whether a problem is related to the P2P transport, compare runs with peer-to-peer GPU memory access disabled or restricted using :ref:`env_NCCL_P2P_DISABLE` and :ref:`env_NCCL_P2P_LEVEL` (see the :doc:`../env` chapter). GPU-to-NIC communication ======================== GPUs can also communicate directly with network cards using GPU Direct RDMA (GDRDMA). This requires having compatible network cards and drivers, plus loading an extra kernel module called ``nvidia-peermem``. The ``nvidia-peermem`` module is now supplied with the CUDA drivers, however it must be loaded on each node boot with: .. code:: shell sudo modprobe nvidia-peermem If ``sudo`` is not accessible an alternative way to verify the module has been loaded is to run: .. code:: shell lsmod | grep nvidia-peermem GDRDMA can also be enabled by using the DMA-BUF feature of recent Linux kernels combined with the open source Nvidia GPU driver. In this case, NCCL will automatically detect and enable DMA-BUF so the nvidia-peermem module will not be necessary. .. _troubleshooting_acs: PCI Access Control Services (ACS) ================================= **Baremetal systems** IO virtualization (also known as VT-d or IOMMU) can interfere with GPU Direct by redirecting all PCI point-to-point traffic to the CPU root complex, causing a significant performance reduction or even a hang. You can check whether ACS is enabled on PCI bridges by running: .. code:: shell sudo lspci -vvv | grep ACSCtl If lines show "SrcValid+", then ACS might be enabled. Looking at the full output of lspci, one can check if a PCI bridge has ACS enabled. .. code:: shell sudo lspci -vvv If PCI switches have ACS enabled, it needs to be disabled. On some systems this can be done from the BIOS by disabling IO virtualization or VT-d. For Broadcom PLX devices, it can be done from the OS but needs to be done again after each reboot. Use the command below to find the PCI bus IDs of PLX PCI bridges: .. code:: shell sudo lspci | grep PLX Next, use setpci to disable ACS with the command below, replacing 03:00.0 by the PCI bus ID of each PCI bridge. .. code:: shell sudo setpci -s 03:00.0 ECAP_ACS+0x6.w=0000 Or you can use a script similar to this: .. code:: shell for BDF in `lspci -d "*:*:*" | awk '{print $1}'`; do # skip if it doesn't support ACS sudo setpci -v -s ${BDF} ECAP_ACS+0x6.w > /dev/null 2>&1 if [ $? -ne 0 ]; then continue fi sudo setpci -v -s ${BDF} ECAP_ACS+0x6.w=0000 done **Virtual machines** Virtual machines require ACS to function, hence disabling ACS is not an option. To run with maximum performance inside virtual machines, ATS needs to be enabled in network adapters. ****************** Topology detection ****************** NCCL relies on /sys to discover the PCI topology of GPUs and network cards. When running inside a virtual machine or container, make sure /sys is properly mounted. Having /sys expose a virtual PCI topology can result in sub-optimal performance.