Performance Requirements#

This section describes the implementation details required to meet performance requirements for AI training and inference workloads while hosting multiple virtual machines on the same worker host.

To meet the performance demands of multi-node AI workloads, it is important that the workload have native access to networking, GPUs and storage. This is regardless of whether operating as Bare Metal, K8s on Linux, or within a VM. Therefore, it is important to implement the hardware passthrough techniques discussed here to get optimal performance.

This RA also outlines how to run multi-GPU and multi-node AI inference and training workloads on either dedicated physical worker hosts or shared worker hosts to maximize hardware utilization. NVIDIA software such as NIM and NeMo Microservices is performance optimized for the GPU instances it runs on. Furthermore, NVIDIA GPUDirect® (RDMA and Storage) accelerate data flow between GPUs, GPU and memory, and GPU and storage.

The NVIDIA® Hardware Reference Architecture for NCPs is built with NVIDIA-Certified systems tailored for optimal AI performance. These systems leverage NVIDIA NVLink Fabric and its optimized PCIe topology, must be preserved when virtualizing the system. Virtual machines require a virtual PCIe topology that exposes the optimized PCIe topology within virtual machines.

Refer to supporting documentation for information on optimal PCIe topology and other performance optimizations such as vCPU pinning and memory placement to appropriate NUMA nodes when using virtual machines.

Virtual Machine Networking#

Networking is one of the critical performance vectors in AI. When running containers or VMs, it is imperative that the operators connect a fully provisioned SRIOV virtual NIC into either the container or the VM. Depending on your SDN system, there are a few well-known ways to do this. Example of one standard way used to assign an SRIOV NIC to a VM:

VMaaS makes an intent-based request to the SDN to add VM X to VPC Y.
SDN does the mapping (assigns VPCs, configures resources, and so on).
SDN triggers PCIe hotplug event, causing the newly defined SRIOV VF to be exposed to the host Linux Kernel.
SDN controller binds the VF to the VFIO/QEMU.

Then, when the VM gets spun up, it has the passthrough networking device present and the VM can directly communicate with the NIC hardware without any software from the host in the middle. This is done for any high-performance networking path requirements. For other pod services, which are less performance critical, it may be acceptable to use a more standard CNI connected path (or to use other SRIOV NIC resources, depending on what the operator wants to do).

Additionally, there are two performance functions that should be considered. First, it is important to think through how to configure the networking services to avoid extra encapsulation of the payload. It should be possible to enable a single tunnel for the tenant overlay network. Second, the VMaaS orchestrator should be topology-aware, so that it can land jobs in such a way to optimize connectivity for things like collectives.

Virtual Machine Traffic Acceleration#

To meet the performance demands of multi-node AI workloads in virtualized environments, virtual machines require hardware-accelerated networking to ensure optimal performance and isolation.

The NCP reference architecture defines three networking types:

Compute Network: High-bandwidth, low-latency GPU-GPU or GPU-CPU connectivity using ConnectX and BlueField SuperNICs (InfiniBand/Ethernet).
Converged Network: High-performance storage and in-band management using BlueField DPUs (Ethernet).
Out-of-band (OOB) Management Network: Low-speed management connectivity (1 Gbps ports).

For high-performance network connectivity with virtual machines, the hypervisor must allocate hardware-accelerated networking functions to tenant virtual machines. These functions ensure virtual machines can access both compute and converged networks efficiently.

Two key technologies enable virtual machine traffic acceleration and isolation:

Single Root I/O Virtualization (SR-IOV): This technology allows network packets to bypass the hypervisor CPU/kernel and be directly forwarded through the NIC hardware. This reduces latency and offloads processing overhead from the CPU.
Accelerated Switching and Packet Processing (ASAP): Building on SR-IOV, this adds advanced capabilities like Software-Defined Networking (SDN) and Virtual Private Cloud (VPC), ensuring greater flexibility and scalability for tenant workloads.

Once these accelerated network functions are assigned to the tenant virtual machines, NVIDIA® Network Operator can provision them as NVIDIA networking resources within the user’s K8s clusters.

Virtualizing a GPU#

Like the networking function, it is critical to give the VM or container direct access to the GPU hardware to get optimized AI performance. There are four different ways to expose a GPU:

Types of GPU Virtualization#
Use Case	Method	Comments
Exclusive GPU for a VM	VFIO/QEMU	Standard SRIOV mechanism
Exclusive GPU to container	NVIDIA container toolkit	Bind mounts the nodes and injects libraries
Multi-instance GPU (MIG) for VM/container	MIG manager + VFIO/QEMU	Hardware partitioned by GPU. Still less secure than exclusive.
Time-sliced “fractional” GPU	NVIDIA vGPU Manager, passthrough of the vGPU slide	Least secure model

This is normally defined by the VMaaS or the BMaaS. For most cases, the primary method will be to allocate a full GPU to a VM, and that VM will decide how to expose it to any containers within.

Refer to the virtualization section, and vGPU for compute section for NVIDIA software components for virtualization.

Enabling Multitenancy and Isolation#

In multitenant environments, hypervisor and virtual machine orchestration solutions must provide strict isolation and performance guarantees, even when workloads from multiple tenants coexist on the same physical host. The virtual machine network can achieve the performance and isolation needed for multitenant AI deployments by utilizing hardware acceleration on NVIDIA BlueField DPUs and ConnectX SuperNICs, which incorporate technologies such as SR-IOV and ASAP2. These technologies enable direct, hardware-enforced partitioning of network resources, minimizing CPU overhead, and reducing latency. By offloading critical networking and security functions to the hardware, they ensure strict tenant isolation while maintaining high throughput and predictable performance across workloads.

Storage Connectivity#

Storage should be external to the solution and provided to the tenant as a multitenant-capable service. This storage service pattern applies to both control plane storage (for example, centralized storage for virtual machines) as well as tenant storage (e.g., block storage as a service) that can be connected to pods for workload launch and runtime data access. Refer to the High-Performance Storage Reference Architecture for detailed requirements regarding external storage implementation.

Virtual Storage#

Storage is equally critical to AI workload performance. A machine, VM, or container may need access to high performance storage. There are several practical options available:

Ephemeral Storage: A local ephemeral can provide good performance. Depending on the infrastructure, this may only be available for bare metal. There are many uses, but the performance driver is when the local AI applications (inference or training) cache local data and model images. To support this, the NVMe drive (or possibly a partition) should be exposed as a local volume such as /dev/nvme01.
High Performance Parallel Storage: Vendors such as VAST and WEKA provide high performance File System and Object, while others like DDN focus specifically on file system. Different choices make sense for different use cases. When exposing the high-performance storage as a file system, the storage path must be exposed to the user. In the VM case, VMaaS layer may simply provide the SRIOV storage NIC to the VM, and the VM is responsible for installing the storage client (for example, NFS client for Vast / WEKA / DDN) and mounting the drive.
A GPU cluster local storage should be considered as a solution for high performance/low latency inference workloads.

A low performance file system or boot drive path can be exposed using standard K8s CSI mechanisms.