GPU troubleshooting

GPU Direct

NCCL heavily relies on GPU Direct for inter-GPU communication. This refers to the ability for a GPU to directly communicate with another device, such as another GPU or a network card, using direct point-to-point PCI messages.

Direct point-to-point PCI messages can fail or perform poorly for a variety of reasons, like missing components, a bad configuration of a virtual machine or a container, or some BIOS settings.

GPU-to-GPU communication

Peer-to-peer GPU memory access. For GPU-to-GPU traffic, NCCL favors peer-to-peer transport when CUDA reports that GPUs can access each other’s memory directly (typically over NVLink, or over PCIe when the topology and driver allow it).

Checking peer-to-peer GPU memory access. You can use nvidia-smi topo -p2p <capability> to print a matrix of P2P status between GPU pairs. The <capability> value is p for PCIe and n for NVLink.

For example, on a healthy PCI Express-based 8-GPU system with full peer-to-peer GPU memory access, the matrix from nvidia-smi topo -p2p p typically looks like the example below:

          GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7
GPU0      X       OK      OK      OK      OK      OK      OK      OK
GPU1      OK      X       OK      OK      OK      OK      OK      OK
GPU2      OK      OK      X       OK      OK      OK      OK      OK
GPU3      OK      OK      OK      X       OK      OK      OK      OK
GPU4      OK      OK      OK      OK      X       OK      OK      OK
GPU5      OK      OK      OK      OK      OK      X       OK      OK
GPU6      OK      OK      OK      OK      OK      OK      X       OK
GPU7      OK      OK      OK      OK      OK      OK      OK      X

If peer-to-peer GPU memory access does not work as expected with NCCL even when nvidia-smi topo -p2p shows OK for peer access on the GPU pairs you use, one common cause is PCI Access Control Services (ACS); see PCI Access Control Services (ACS).

CUDA sample. The cuda-samples repository includes a useful program for checking peer-to-peer GPU memory access. Follow the instructions there to build the sample.

simpleP2P (Samples/0_Introduction/simpleP2P) checks the transferred data after each transfer to confirm that the copy completed successfully. A successful run prints additional lines earlier in the log; the excerpt below is truncated and shows only the final lines:

⋮
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Disabling peer access...
Shutting down...
Test passed

The last line of the run should read Test passed.

To measure the available bandwidth between GPUs, we recommend using nvbandwidth rather than the CUDA sample above. Download and build it following the instructions at https://github.com/NVIDIA/nvbandwidth.

Disabling peer-to-peer GPU memory access for testing. To see whether a problem is related to the P2P transport, compare runs with peer-to-peer GPU memory access disabled or restricted using NCCL_P2P_DISABLE and NCCL_P2P_LEVEL (see the Environment Variables chapter).

GPU-to-NIC communication

GPUs can also communicate directly with network cards using GPU Direct RDMA (GDRDMA). This requires having compatible network cards and drivers, plus loading an extra kernel module called nvidia-peermem. The nvidia-peermem module is now supplied with the CUDA drivers, however it must be loaded on each node boot with:

sudo modprobe nvidia-peermem

If sudo is not accessible an alternative way to verify the module has been loaded is to run:

lsmod | grep nvidia-peermem

GDRDMA can also be enabled by using the DMA-BUF feature of recent Linux kernels combined with the open source Nvidia GPU driver. In this case, NCCL will automatically detect and enable DMA-BUF so the nvidia-peermem module will not be necessary.

PCI Access Control Services (ACS)

Baremetal systems

IO virtualization (also known as VT-d or IOMMU) can interfere with GPU Direct by redirecting all PCI point-to-point traffic to the CPU root complex, causing a significant performance reduction or even a hang. You can check whether ACS is enabled on PCI bridges by running:

sudo lspci -vvv | grep ACSCtl

If lines show “SrcValid+”, then ACS might be enabled. Looking at the full output of lspci, one can check if a PCI bridge has ACS enabled.

sudo lspci -vvv

If PCI switches have ACS enabled, it needs to be disabled. On some systems this can be done from the BIOS by disabling IO virtualization or VT-d. For Broadcom PLX devices, it can be done from the OS but needs to be done again after each reboot.

Use the command below to find the PCI bus IDs of PLX PCI bridges:

sudo lspci | grep PLX

Next, use setpci to disable ACS with the command below, replacing 03:00.0 by the PCI bus ID of each PCI bridge.

sudo setpci -s 03:00.0 ECAP_ACS+0x6.w=0000

Or you can use a script similar to this:

for BDF in `lspci -d "*:*:*" | awk '{print $1}'`; do
  # skip if it doesn't support ACS
  sudo setpci -v -s ${BDF} ECAP_ACS+0x6.w > /dev/null 2>&1
  if [ $? -ne 0 ]; then
    continue
  fi
  sudo setpci -v -s ${BDF} ECAP_ACS+0x6.w=0000
done

Virtual machines

Virtual machines require ACS to function, hence disabling ACS is not an option. To run with maximum performance inside virtual machines, ATS needs to be enabled in network adapters.

Topology detection

NCCL relies on /sys to discover the PCI topology of GPUs and network cards. When running inside a virtual machine or container, make sure /sys is properly mounted. Having /sys expose a virtual PCI topology can result in sub-optimal performance.