Configuring Virtual Machines#

There are two ways you can accurately mirror the underlying hardware topology inside the Virtual Machine:

Customize the virtual machine configuration with stub PCIe switches, presenting the devices within the PCIe groups with their matching components and in the correct sequence, or

Customize the NCCL topology file (XML) to match your virtual machine resources in their natural enumeration order with the underlying real topology

VM Configuration Pre-requisites#

Install the base operating system

Disable nouveau module and lock devices to avoid resource conflict and to prevent the GPUs from being used by the OS.

sudo bash -c "echo blacklist nouveau > /etc/modprobe.d/blacklist-nvidia-nouveau.conf"
sudo bash -c "echo options nouveau modeset=0 >> /etc/modprobe.d/blacklist-nvidia-nouveau.conf"
sudo update-initramfs -u
sudo grubby --args="pci-stub.ids=10de:22a4,10de:2336,15b3:1021" --update-kernel=ALL

Note

This example was created with the devices below - check your system as the device IDs may vary.

10de:22a4 -- 3rd Gen. NVLink switches
10de:2336 -- H200 SXM
15b3:1021 -- ConnectX-7

Important

Make sure you are not connected through a ConnectX-7 card, otherwise your access could be lost after reboot. This example implies the use of a LOM card.

Alternatively, for distributions without Grubby:

vi /etc/default/grub

Add these parameters to GRUB_CMDLINE_LINUX:

GRUB_CMDLINE_LINUX="pci-stub.ids=10de:22a4,10de:2336,15b3:1021"

Then update:

sudo update-grub

Install KVM, libvirt, and dependencies

sudo apt install qemu-kvm libvirt-daemon-system libvirt-clients bridge-utils

Install UEFI support
sudo apt install ovmf
Check KVM support
kvm-ok
If KVM isn’t working, check your system firmware settings and make sure all virtualization features are enabled.
Reboot to apply GRUB changes.

Create the base virtual machine (example below), but many other methods are available for this step, such as virt-manager and cockpit. Also make sure the OS variant, memory size and vCPU count are adequate to your system and OS of choice.

sudo virt-install \
-n vm \
--description "Test VM with full GPU passthrough" \
--os-variant=ubuntu24.04 \
--ram=1966080 \       #do not use all the available memory as it will be fully reserved
--vcpus=192 \           #do not use all the cores as we will later apply 1-to-1 pinning
--cpu host \
--boot uefi \
--disk path=/var/lib/libvirt/images/vm.img,bus=virtio,size=400 \
--location /var/lib/libvirt/images/ubuntu-24.04.1-live-server-amd64.iso,kernel=casper/vmlinuz,initrd=casper/initrd \
--graphics none \
--network network=default \

Power up the VM and install the VM OS
Disable nouveau on the Guest VM OS
On a full passthrough setup, that is when all the GPUs of an HGX board are mapped to a single VM, the NVLink switches should be passed along to the virtual machine and nvidia-fabricmanager should be running inside the VM
If the VM works and the Guest OS is running as intended, power off the VM to prepare it for XML customisations.

Editing a VM’s XML definition#

To achieve the correct vCPU, NUMA, and PCIe configuration for the guest VM, the VM’s domain XML requires editing. This is achieved by using the virsh command.

First, save a copy of the VM’s domain XML to a file, using ‘virsh dumpxml’. For example, using the VM named ‘rocky9’ -

sudo virsh dumpxml rocky9 > rocky9-cfg.xml

To edit the VM’s domain XML, use ‘virsh edit’. For example:

sudo virsh edit rocky9

This opens the XML in a text editor (defined by the $EDITOR or $VISUAL environment variables, and defaults to ‘vi’). Upon exiting the editor, the domain XML is verified and then applied, to take effect at the next VM boot.

Alternatively, a domain XML file may be edited independently and then applied to the VM configuration using ‘virsh define’:

sudo virsh define rocky9-cfg.xml –-validate

In this case the domain that is updated is defined by the <name> element within the XML file.

Scenario: Full passthrough- All GPUs, NICs, Storage Controllers, NVswitches#

In this configuration, all GPUs, NVswitch devices, GPUDirect-capable NICs and Storage Controllers are assigned to a single guest VM. The guest VM is configured with the majority of the server’s CPU and memory resources. Copy and edit the VM’s domain XML to set up the vCPU configuration, pinning, and PCI Express configuration.

Virtual CPU configuration#

Virtual CPU sockets and NUMA nodes#

Edit the <cpu> element in the XML to configure the number of vCPU cores / hyperthreads exposed to the guest, the number of CPU sockets they occupy, and the NUMA nodes / memory associated with each socket. For a single VM utilizing the full resources of the server, CPUs and memory from all sockets will be assigned to the VM, so the virtual socket / NUMA configuration should match that of the underlying physical configuration (which can be determined using lstopo or lscpu).

This example XML configuration creates two virtual CPU sockets with 40 cores / 40 threads per socket (no hyperthreads), and ~1TB of RAM per socket organized in a single NUMA node:

...
<cpu mode='host-passthrough' check='none' migratable='on'>
  <topology sockets='2' dies='1' clusters='1' cores='40' threads='1'/>
  <numa>
    <cell id='0' cpus='0-39' memory='939524096' unit='KiB'/>
    <cell id='1' cpus='40-79' memory='939524096' unit='KiB'/>
  </numa>
</cpu>
...

The resulting topology (lstopo -sv) within the guest VM:

Machine (2015GB total)
Package L#0
  NUMANode L#0 (P#0 1007GB)
  L3 L#0 (105MB)
    L2 L#0 (2048KB) + L1d L#0 (48KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
    L2 L#1 (2048KB) + L1d L#1 (48KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#2)
    ...
    L2 L#39 (2048KB) + L1d L#39 (48KB) + L1i L#39 (32KB) + Core L#39 + PU L#39 (P#78)
Package L#1
  NUMANode L#1 (P#1 1008GB)
  L3 L#1 (105MB)
    L2 L#40 (2048KB) + L1d L#40 (48KB) + L1i L#40 (32KB) + Core L#40 + PU L#40 (P#1)
    L2 L#41 (2048KB) + L1d L#41 (48KB) + L1i L#41 (32KB) + Core L#41 + PU L#41 (P#3)
    ...
    L2 L#79 (2048KB) + L1d L#79 (48KB) + L1i L#79 (32KB) + Core L#79 + PU L#79 (P#79)
  HostBridge
  …

Virtual CPU pinning#

To take full advantage of NUMA configurations, it’s essential that the guest’s vCPUs are scheduled consistently on the same physical CPU socket / NUMA node. This can be achieved by pinning virtual CPUs to specific physical CPU cores using the <cputune> element. Use lscpu or virsh capabilities to determine how physical CPU cores on the server are numbered (this frequently differs between different server models).

This example XML configuration pins the 40 vCPUs on virtual socket 0 (vCPU numbers 0-39) to physical CPUs on socket 0, and the 40 vCPUs on virtual socket 1 (vCPU numbers 40-79) to physical CPUs on socket 1. (Note that on this server the physical CPU numbering has even-numbered CPUs on socket 0 / NUMA node 0, and odd-numbered CPUs on socket 1 / NUMA node 1).

...
<vcpu placement='static'>80</vcpu>
<cputune>
    <vcpupin vcpu='0' cpuset='94'/>
    <vcpupin vcpu='1' cpuset='92'/>
    <vcpupin vcpu='2' cpuset='90'/>
    <vcpupin vcpu='3' cpuset='88'/>
    <vcpupin vcpu='4' cpuset='86'/>
    <vcpupin vcpu='5' cpuset='84'/>
    <vcpupin vcpu='6' cpuset='82'/>
    <vcpupin vcpu='7' cpuset='80'/>
    <vcpupin vcpu='8' cpuset='78'/>
    <vcpupin vcpu='9' cpuset='76'/>
    <vcpupin vcpu='10' cpuset='74'/>
    <vcpupin vcpu='11' cpuset='72'/>
    <vcpupin vcpu='12' cpuset='70'/>
    <vcpupin vcpu='13' cpuset='68'/>
    <vcpupin vcpu='14' cpuset='66'/>
    <vcpupin vcpu='15' cpuset='64'/>
    <vcpupin vcpu='16' cpuset='62'/>
    <vcpupin vcpu='17' cpuset='60'/>
    <vcpupin vcpu='18' cpuset='58'/>
    <vcpupin vcpu='19' cpuset='56'/>
    <vcpupin vcpu='20' cpuset='54'/>
    <vcpupin vcpu='21' cpuset='52'/>
    <vcpupin vcpu='22' cpuset='50'/>
    <vcpupin vcpu='23' cpuset='48'/>
    <vcpupin vcpu='24' cpuset='46'/>
    <vcpupin vcpu='25' cpuset='44'/>
    <vcpupin vcpu='26' cpuset='42'/>
    <vcpupin vcpu='27' cpuset='40'/>
    <vcpupin vcpu='28' cpuset='38'/>
    <vcpupin vcpu='29' cpuset='36'/>
    <vcpupin vcpu='30' cpuset='34'/>
    <vcpupin vcpu='31' cpuset='32'/>
    <vcpupin vcpu='32' cpuset='30'/>
    <vcpupin vcpu='33' cpuset='28'/>
    <vcpupin vcpu='34' cpuset='26'/>
    <vcpupin vcpu='35' cpuset='24'/>
    <vcpupin vcpu='36' cpuset='22'/>
    <vcpupin vcpu='37' cpuset='20'/>
    <vcpupin vcpu='38' cpuset='18'/>
    <vcpupin vcpu='39' cpuset='16'/>
    <vcpupin vcpu='40' cpuset='95'/>
    <vcpupin vcpu='41' cpuset='93'/>
    <vcpupin vcpu='42' cpuset='91'/>
    <vcpupin vcpu='43' cpuset='89'/>
    <vcpupin vcpu='44' cpuset='87'/>
    <vcpupin vcpu='45' cpuset='85'/>
    <vcpupin vcpu='46' cpuset='83'/>
    <vcpupin vcpu='47' cpuset='81'/>
    <vcpupin vcpu='48' cpuset='79'/>
    <vcpupin vcpu='49' cpuset='77'/>
    <vcpupin vcpu='50' cpuset='75'/>
    <vcpupin vcpu='51' cpuset='73'/>
    <vcpupin vcpu='52' cpuset='71'/>
    <vcpupin vcpu='53' cpuset='69'/>
    <vcpupin vcpu='54' cpuset='67'/>
    <vcpupin vcpu='55' cpuset='65'/>
    <vcpupin vcpu='56' cpuset='63'/>
    <vcpupin vcpu='57' cpuset='61'/>
    <vcpupin vcpu='58' cpuset='59'/>
    <vcpupin vcpu='59' cpuset='57'/>
    <vcpupin vcpu='60' cpuset='55'/>
    <vcpupin vcpu='61' cpuset='53'/>
    <vcpupin vcpu='62' cpuset='51'/>
    <vcpupin vcpu='63' cpuset='49'/>
    <vcpupin vcpu='64' cpuset='47'/>
    <vcpupin vcpu='65' cpuset='45'/>
    <vcpupin vcpu='66' cpuset='43'/>
    <vcpupin vcpu='67' cpuset='41'/>
    <vcpupin vcpu='68' cpuset='39'/>
    <vcpupin vcpu='69' cpuset='37'/>
    <vcpupin vcpu='70' cpuset='35'/>
    <vcpupin vcpu='71' cpuset='33'/>
    <vcpupin vcpu='72' cpuset='31'/>
    <vcpupin vcpu='73' cpuset='29'/>
    <vcpupin vcpu='74' cpuset='27'/>
    <vcpupin vcpu='75' cpuset='25'/>
    <vcpupin vcpu='76' cpuset='23'/>
    <vcpupin vcpu='77' cpuset='21'/>
    <vcpupin vcpu='78' cpuset='19'/>
    <vcpupin vcpu='79' cpuset='17'/>
</cputune>
...

Virtual PCI Express configuration#

To create a virtual PCIe hierarchy that’s the equivalent of the physical PCIe hierarchy on the HGX host system, it is necessary to create two separate hierarchies, each associated with one of the two virtual NUMA nodes in the system. This is achieved using virtual PCIe Expander Bus devices, PCIe root ports, and PCIe switches. The expander bus devices are each associated with a virtual NUMA node; everything below them in the hierarchy is then automatically associated with the same virtual NUMA node.

Please refer to a sample libvirt domain XML for HGX H200 8-GPU platform in the Appendix section of this document.

Virtual PCIe expander bus#

Create virtual PCIe Expander Bus devices, each associated with one of the VM’s NUMA nodes. Note that the NUMA nodes per socket is not just dependent on the actual CPU model, but it can also be modified in some CPUs as a firmware parameter, usually referred to as NUMA Per Socket, or NPS.

In this example two buses are created, one for each CPU socket / NUMA node in the virtual platform.

...
<!-- PCI expander bus NUMA node 0 -->
<controller type='pci' index='10' model='pcie-expander-bus'>
  <target busNr='0x20'>
    <node>0</node>
  </target>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x6'/>
</controller>

<!-- PCI expander bus NUMA node 1 -->
<controller type='pci' index='11' model='pcie-expander-bus'>
  <target busNr='0x40'>
    <node>1</node>
  </target>
  <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x7'/>
</controller>
...

index: a unique identifier for this PCIe device definition. The index value may be used in the bus attribute of subsequent PCIe device definitions, to associate them with this expansion bus.
model: set to ‘pcie-expander-bus’ to define a PCIe expansion bus root.
busNr: sets the root PCIe bus number for the hierarchy. Choose bus numbers with sufficient separation to allow the required number of PCIe devices within each hierarchy (each device occupies a subordinate bus number, as do intervening PCIe switches). The example shown here, with bus numbers 0x20 and 0x40, allows for ~32 devices / switches in each hierarchy.
node: indicates the NUMA node the hierarchy is associated with. The value used here must be established in an earlier <numa> definition in the VM’s virtual CPU settng:s.
address: defines where the expansion bus roots exist in the virtual PCIe hierarchy. In this example, the buses are created as separate functions (set by the function attributes 0x6, 0x7) on a single device (set by the slot attribute 0x2), on bus 0x0, the root bus of the virtual hierarch``

Virtual PCIe root ports#

To plug additional virtual PCIe devices into an expander bus, create one or more virtual PCIe root ports. In this example, four virtual root ports are created to host four virtual PCIe switches, as the physical platform implements four PCIe switches on each CPU socket.

...
<!-- 4 root ports on bus 10 (index of upstream expander bus) NUMA node 0 –->


<controller type='pci' index='15' model='pcie-root-port'>
  <address type='pci' bus='10' slot='0x00' function='0x0' multifunction='on'/>
</controller>
<controller type='pci' index='16' model='pcie-root-port'>
  <address type='pci' bus='10' slot='0x00' function='0x1'/>
</controller>
<controller type='pci' index='17' model='pcie-root-port'>
  <address type='pci' bus='10' slot='0x00' function='0x2'/>
</controller>
<controller type='pci' index='18' model='pcie-root-port'>
  <address type='pci' bus='10' slot='0x00' function='0x3'/>
</controller>
...

index: a unique identifier for this PCIe device definition. The index value may be used in the bus attribute of a subsequent PCIe device definition, to effectively “plug” it into this root port.
model: set to ‘pcie-root-port’ to define a PCIe root port.
address: defines where the root ports exist in the virtual PCIe hierarchy. In this example, four root ports are created as separate functions (set by the function attribute 0x0, 0x1, 0x2, 0x3) on a single device (set by the slot attribute 0x0). The bus attribute is set to the index of the PCIe device that these ports are connected to; in this case bus=’10’ refers to the PCI expander bus earlier defined with index=’10’.

Virtual PCIe switches#

To create a virtual PCIe switch under a PCIe root port.

...
<!-- 4 port PCIe switch on bus 15 (index of upstream root port) / func 0 -->
<controller type='pci' index='25' model='pcie-switch-upstream-port'>
  <address type='pci' bus='15' slot='0x00' function='0x0'/>
</controller>
<controller type='pci' index='26' model='pcie-switch-downstream-port'>
  <address type='pci' bus='25' slot='0x00' function='0x0'/>
</controller>
<controller type='pci' index='27' model='pcie-switch-downstream-port'>
  <address type='pci' bus='25' slot='0x01' function='0x0'/>
</controller>
<controller type='pci' index='28' model='pcie-switch-downstream-port'>
  <address type='pci' bus='25' slot='0x02' function='0x0'/>
</controller>
<controller type='pci' index='29' model='pcie-switch-downstream-port'>
  <address type='pci' bus='25' slot='0x03' function='0x0'/>
</controller>
...

index: a unique identifier for a PCIe device definition. The index value of the downstream ports may be used in the bus attribute of a subsequent PCIe device definition, to effectively “plug” that device into the downstream port.
model: set to ‘pcie-switch-upstream-port’ to define a PCIe switch’s upstream port.Set to ‘pcie-switch-downstream-port’ to define a PCIe switch downstream port.
address: defines where the switch’s upstream/downstream ports exist in the virtual PCIe hierarchy. For the upstream port, the bus number should be the index value of the root port that the switch is connected to; in this case bus=’15’ refers to the PCIe root port earlier defined with index=’15’. For the four downstream ports, the bus number should be set to the index value of the upstream port. The slot number should be set to an incrementing value starting at 0x0.

Passthrough device assignment#

Passthrough devices may be assigned to the VM using a management tool, or by directly editing the VM’s domain XML. Each method results in a <hostdev> element in the domain XML. For example, passthrough of a full PCIe device results in the following form of <hostdev> element:

...
<hostdev mode='subsystem' type='pci' managed='yes'>
  <driver name='vfio'/>
  <source>
    <address domain='0x0000' bus='0xbc' slot='0x00' function='0x0'/>
  </source>
  <alias name='hostdev0'/>
  <address type='pci' domain='0x0000' bus='57' slot='0x00' function='0x0'/>
</hostdev>
...

source: this contains an <address> element that defines the physical device to be passed through, by its domain, bus, device (‘slot’), and function number. (This is reported by ‘lspci’ on the host)
address: this defines where the passthrough device should be located in the VM’s virtual PCIe hierarchy. The ‘bus’ value used here should refer to the ‘index’ value of a PCIe root port or switch downstream port defined earlier in the domain XML.

To achieve a virtual PCIe topology that’s equivalent to the physical topology, edit the address element of each passthrough device so it’s assigned to the same virtual PCIe switch as peer devices that are also assigned to the VM. An example of this is in the Appendix, Example libvirt domain XML for HGX H200 8-GPU platform.