PCIe Multi-GPU systems#

On multi-GPU systems, the Triton server uses peer to peer memory copy to transfer data between different GPUs whenever this feature is available, i.e., cudaDeviceCanAccessPeer() returns true.

However, on bare-metal Linux systems with PCIe topology, IOMMU-enabled peer to peer memory copy is not supported (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#iommu-on-linux)

and it is recommended that IOMMU be set to passthrough (by setting the Linux kernel parameter iommu=pt) when IOMMU is enabled.

The following are the steps to set IOMMU to passthrough in GRUB:

  1. Run the following command and check if it displays any output. If the command produces output, it indicates that IOMMU is enabled, and you should follow the steps below. If there is no output, no further action is required.

    dmesg | grep -e DMAR -e IOMMU
    
  2. Check if IOMMU is set to passthrough by running the following command. If the command produces output, this means IOMMU is already set to passthrough, and no further action is required.

    If there is no output, follow the steps below:

    dmesg | grep -i -e iommu=pt -e iommu.*passthrough
    
  3. Open /etc/default/grub file for edit and add iommu=pt to GRUB_CMDLINE_LINUX option:

    e.g.,
    
    .....
    
    GRUB_CMDLINE_LINUX="crashkernel=auto quiet iommu=pt"
    
    .....
    
  4. Use grub-mkconfig or grub2-mkconfig based on the system’s OS to generate configuration file:

    • On systems with BIOS:

      grub-mkconfig -o /boot/grub2/grub.cfg #on ubuntu, debian
      
      grub2-mkconfig -o /boot/grub2/grub.cfg #on centos, rockylinux
      
    • On systems with UEFI:

      grub-mkconfig -o /boot/efi/EFI/<os_name>/grub.cfg #on ubuntu, debian
      
      grub2-mkconfig -o /boot/efi/EFI/<os_name>/grub.cfg #on centos,
      rockylinux
      
      #replace <os_name> by ubuntu, centos, debian, or rocky
      
  5. On ubuntu and debian, may need to install grub-mkconfig via:

    apt install grub-common
    
  6. Reboot the system:

    systemctl reboot
    
  7. Verify that IOMMU is set to passthrough using Step 2.