Introduction#

About this paper#

This white paper provides detailed guidance on configuring virtual machines (VMs) to support AI/ML workloads when a hypervisor layer is deployed on top of HGX systems. The goal of this paper is to equip datacenter administrators and platform operators with actionable best practices to ensure their virtualized AI/ML workloads run efficiently from both a performance and cost perspective. This paper provides recommendations that preserve topological awareness inside the VM in order to achieve near bare-metal performance for distributed ML training and AI inference. The recommendations are based on 8-GPU H200 HGX systems running a single VM configured for full GPU and NIC passthrough and deployed on a Red Hat Enterprise Linux KVM hypervisor. This methodology can also be extended and applied to a multi-VM scenario, but requires additional considerations.

Problem Statement#

AI frameworks such as Pytorch need precise information of the underlying hardware and device mappings for the Graphical Processing Units (GPU). This includes information on how the GPUs are arranged, the placement of Network Interface Cards (NIC), the non-uniform memory access (NUMA) configuration, whether GPUs are interconnected throughPCIe switches or using NVLink, and more. This information enables AI frameworks to form optimized communication paths for collective operations. Without accurate information about topology, there is a risk of defaulting to suboptimal data paths which can increase latency, and degrade the overall performance.

In a virtualized environment, VMs can incorrectly perceive all underlying resources as uniformly accessible, thus losing visibility of precise GPU or NIC placement and/or NUMA alignment. This abstraction can directly impact the performance of communication libraries such as NCCL (NVIDIA Collective Communication Library), which relies on accurate topology information to optimize communication routes for distributed model training. In multi-node AI systems with 4-8 GPUs per node, this improper mapping can incur significant communication overhead, and result in substantially longer training and inference time.

Target Audience#

This paper is intended for datacenter administrators, platform operators and any individual, group, or organization responsible for designing, deploying, and maintaining virtualized infrastructure stacks for AI/ML workloads running on NVIDIA Certified systems in their data centers. It assumes the reader has Linux expertise and familiarity with server setup, hypervisor installation, and VM configuration.

Enviroment#

This whitepaper references an 8-GPU H200 HGX system running a single VM configured with full GPU and NIC passthrough, deployed on Red Hat Enterprise Linux KVM hypervisor. The hypervisor provides a virtualization layer to deploy VMs that can host application containers running AI frameworks like TensorFlow, PyTorch, NVIDIA NIMs or custom AI applications.

The benefits of introducing a hypervisor include:

The hypervisor layer abstracts the underlying GPUs, NICs and NUMA resources. This abstraction simplifies future changes, such as hardware modifications/upgrades, adjusting NUMA configurations, or introducing new GPUs, without re-architecting the entire environment.

The hypervisor ensures a stable, standardized runtime for running AI frameworks and applications, and ensures consistent and reliable conditions for software stack changes (e.g., driver updates or application upgrades)

The hypervisor enables independent updates to the host OS without directly impacting the VM’s runtime environment. Critical patches, driver updates can be applied to the host while the VM remains stable, reducing downtime and complexity.

The hypervisor establishes the foundation for future scalability. As use cases evolve, additional VMs can be introduced rapidly, leveraging existing virtualization infrastructure to scale effectively, easing the transition toward concurrent multi-tenant deployments, thus consolidating resources and optimizing capital expenditure.

To leverage these benefits when running training or inference workloads in virtualized environments without incurring performance penalties, the virtual machine(s) must be configured in a topology-aware manner.