> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/dsx/llms.txt.
> For full documentation content, see https://docs.nvidia.com/dsx/llms-full.txt.

# Performance Requirements

This section describes the implementation details required to meet
performance requirements for AI training and inference workloads while
hosting multiple virtual machines on the same worker host.

To meet the performance demands of multi-node AI workloads, it is
important that the workload have native access to networking, GPUs and
storage. This is regardless of whether operating as Bare Metal, K8s on
Linux, or within a VM. Therefore, it is important to implement the
hardware passthrough techniques discussed here to get optimal
performance.

This RA also outlines how to run multi-GPU and multi-node AI inference
and training workloads on either dedicated physical worker hosts or
shared worker hosts to maximize hardware utilization. NVIDIA software
such as NIM and NeMo Microservices is performance optimized for the GPU
instances it runs on. Furthermore, NVIDIA [GPUDirect®](https://developer.nvidia.com/gpudirect) (RDMA and Storage)
accelerate data flow between GPUs, GPU and memory, and GPU and storage.

The NVIDIA® Hardware Reference Architecture for NCPs is built with
NVIDIA-Certified systems tailored for optimal AI performance. These
systems leverage NVIDIA NVLink Fabric and its optimized PCIe topology,
must be preserved when virtualizing the system. Virtual machines require
a virtual PCIe topology that exposes the optimized PCIe topology within
virtual machines.

Refer to supporting [documentation](https://docs.nvidia.com/ai-enterprise/planning-resource/optimizing-vm-configuration-ai-inference/latest/introduction.html) for information on optimal PCIe
topology and other performance optimizations such as vCPU pinning and
memory placement to appropriate NUMA nodes when using virtual machines.

## Virtual Machine Networking

Networking is one of the critical performance vectors in AI. When
running containers or VMs, it is imperative that the operators connect a
fully provisioned SRIOV virtual NIC into either the container or the VM.
Depending on your SDN system, there are a few well-known ways to do
this. Example of one standard way used to assign an SRIOV NIC to a VM:

1. VMaaS makes an intent-based request to the SDN to add VM X to VPC Y.
2. SDN does the mapping (assigns VPCs, configures resources, and so on).
3. SDN triggers PCIe hotplug event, causing the newly defined SRIOV VF
   to be exposed to the host Linux Kernel.
4. SDN controller binds the VF to the VFIO/QEMU.

Then, when the VM gets spun up, it has the passthrough networking
device present and the VM can directly communicate with the NIC
hardware without any software from the host in the middle. This is
done for any high-performance networking path requirements. For
other pod services, which are less performance critical, it may be
acceptable to use a more standard CNI connected path (or to use
other SRIOV NIC resources, depending on what the operator wants to
do).

Additionally, there are two performance functions that should be
considered. First, it is important to think through how to configure the
networking services to avoid extra encapsulation of the payload. It
should be possible to enable a single tunnel for the tenant overlay
network. Second, the VMaaS orchestrator should be topology-aware, so
that it can land jobs in such a way to optimize connectivity for things
like collectives.

## Virtual Machine Traffic Acceleration

To meet the performance demands of multi-node AI workloads in
virtualized environments, virtual machines require hardware-accelerated
networking to ensure optimal performance and isolation.

The NCP reference architecture defines three networking types:

* **Compute Network**: High-bandwidth, low-latency GPU-GPU or GPU-CPU
  connectivity using ConnectX and BlueField SuperNICs
  (InfiniBand/Ethernet).
* **Converged Network**: High-performance storage and in-band management
  using BlueField DPUs (Ethernet).
* **Out-of-band (OOB) Management Network**: Low-speed management
  connectivity (1 Gbps ports).

For high-performance network connectivity with virtual machines, the
hypervisor must allocate hardware-accelerated networking functions to
tenant virtual machines. These functions ensure virtual machines can
access both compute and converged networks efficiently.

Two key technologies enable virtual machine traffic acceleration and
isolation:

* **Single Root I/O Virtualization (SR-IOV)**: This technology allows
  network packets to bypass the hypervisor CPU/kernel and be directly
  forwarded through the NIC hardware. This reduces latency and offloads
  processing overhead from the CPU.
* **Accelerated Switching and Packet Processing (ASAP)**: Building on
  SR-IOV, this adds advanced capabilities like Software-Defined
  Networking (SDN) and Virtual Private Cloud (VPC), ensuring greater
  flexibility and scalability for tenant workloads.

Once these accelerated network functions are assigned to the tenant
virtual machines, NVIDIA® Network Operator can provision them as NVIDIA
networking resources within the user's K8s clusters.

## Virtualizing a GPU

Like the networking function, it is critical to give the VM or container
direct access to the GPU hardware to get optimized AI performance. There
are four different ways to expose a GPU:

**Types of GPU Virtualization**

| Use Case                                  | Method                                             | Comments                                                       |
| ----------------------------------------- | -------------------------------------------------- | -------------------------------------------------------------- |
| Exclusive GPU for a VM                    | VFIO/QEMU                                          | Standard SRIOV mechanism                                       |
| Exclusive GPU to container                | NVIDIA container toolkit                           | Bind mounts the nodes and injects libraries                    |
| Multi-instance GPU (MIG) for VM/container | MIG manager + VFIO/QEMU                            | Hardware partitioned by GPU. Still less secure than exclusive. |
| Time-sliced "fractional" GPU              | NVIDIA vGPU Manager, passthrough of the vGPU slide | Least secure model                                             |

This is normally defined by the VMaaS or the BMaaS. For most cases, the
primary method will be to allocate a full GPU to a VM, and that VM will
decide how to expose it to any containers within.

Refer to the [virtualization](/dsx/part-2-software-components/nvidia-software-for-infrastructure-as-a-service#virtualization) section, and
[vGPU for compute section](/dsx/part-2-software-components/nvidia-software-for-infrastructure-as-a-service#nvidia-vgpu-for-compute) for NVIDIA software components for
virtualization.

## Enabling Multitenancy and Isolation

In multitenant environments, hypervisor and virtual machine
orchestration solutions must provide strict isolation and performance
guarantees, even when workloads from multiple tenants coexist on the
same physical host. The virtual machine network can achieve the
performance and isolation needed for multitenant AI deployments by
utilizing hardware acceleration on NVIDIA BlueField DPUs and ConnectX
SuperNICs, which incorporate technologies such as SR-IOV and ASAP2. These
technologies enable direct, hardware-enforced partitioning of network
resources, minimizing CPU overhead, and reducing latency. By offloading
critical networking and security functions to the hardware, they ensure
strict tenant isolation while maintaining high throughput and
predictable performance across workloads.

## Storage Connectivity

Storage should be external to the solution and provided to the tenant as
a multitenant-capable service. This storage service pattern applies to
both control plane storage (for example, centralized storage for virtual
machines) as well as tenant storage (e.g., block storage as a service)
that can be connected to pods for workload launch and runtime data
access. Refer to the High-Performance Storage Reference Architecture for
detailed requirements regarding external storage implementation.

## Virtual Storage

Storage is equally critical to AI workload performance. A machine, VM,
or container may need access to high performance storage. There are
several practical options available:

* **Ephemeral Storage**: A local ephemeral can provide good performance.
  Depending on the infrastructure, this may only be available for bare
  metal. There are many uses, but the performance driver is when the
  local AI applications (inference or training) cache local data and
  model images. To support this, the NVMe drive (or possibly a
  partition) should be exposed as a local volume such as /dev/nvme01.
* **High Performance Parallel Storage**: Vendors such as VAST and WEKA
  provide high performance File System and Object, while others like DDN
  focus specifically on file system. Different choices make sense for
  different use cases. When exposing the high-performance storage as a
  file system, the storage path must be exposed to the user. In the VM
  case, VMaaS layer may simply provide the SRIOV storage NIC to the VM,
  and the VM is responsible for installing the storage client (for example, NFS
  client for Vast / WEKA / DDN) and mounting the drive.
* A GPU cluster local storage should be considered as a solution for
  high performance/low latency inference workloads.

A low performance file system or boot drive path can be exposed using
standard K8s CSI mechanisms.