Performance Requirements#

In this section we will describe implementation details required to meet performance requirements for AI training and inference workloads, while offering concurrent multi-tenant Kubernetes as a Service.

We have developed this reference architecture to run multi-GPU and multi-node AI inference and training workloads on either dedicated physical worker hosts or shared worker hosts to maximized hardware utilization. NVIDIA software such as NIM and NeMo Microservices are performance optimized for the GPU instances they run on. Furthermore, NVIDIA provides technologies such as GPUDirect (RDMA and Storage) to accelerate the data flow between GPUs, GPU and memory, and GPU and storage.

The NVIDIA Hardware Reference Architecture for NCPs is built with NVIDIA-Certified Systems that are tailored for optimal AI performance. These systems leverage NVIDIA’s NVLink Fabric and its optimized PCIe topology which needs to be maintained when virtualizing the system. Virtual machines need to have a virtual PCIe topology that exposes the optimized PCIe topology within virtual machines.

See Supporting Documentation for information on optimal PCIe topology and other performance optimizations such as pinning vCPUs and memory to appropriate NUMA nodes with virtual machines.

Virtual Machine Traffic Acceleration#

To meet the performance demands of multi-node AI workloads in virtualized environments, virtual machines require hardware-accelerated networking to ensure optimal performance and isolation.

The NCP reference architecture defines three networking types:

Compute Network: High-bandwidth, low-latency GPU-GPU or GPU-CPU connectivity using ConnectX and BlueField SuperNICs (InfiniBand/Ethernet).

Converged Network: High-performance storage and in-band management using BlueField DPUs (Ethernet).

Out-of-Band (OOB) Management Network: Low-speed management connectivity (1Gbps ports).

For high-performance network connectivity with virtual machines, the hypervisor must allocate hardware-accelerated networking functions to tenant virtual machines. These functions ensure the virtual machines can access both the compute and converged networks efficiently.

Technologies for Virtual Machine Network Acceleration#

Two key technologies enable virtual machine traffic acceleration and isolation:

SR-IOV (Single Root I/O Virtualization): This technology allows network packets to bypass the hypervisor CPU/kernel and be directly forwarded through the NIC hardware. This reduces latency and offloads processing overhead from the CPU.
Accelerated Switching and Packet Processing (ASAP2): Building on SR-IOV, this adds advanced capabilities like Software-Defined Networking (SDN) and Virtual Private Cloud (VPC), ensuring greater flexibility and scalability for tenant workloads.

Once these accelerated network functions are assigned to the tenant virtual machines, NVIDIA Network Operator can provision them as NVIDIA networking resources within the user’s Kubernetes clusters.

Enabling Multi-Tenancy and Isolation#

In multi-tenant environments, hypervisor and virtual machine orchestration solutions must provide strict isolation and performance guarantees, even when workloads from multiple tenants coexist on the same physical host.

The virtual machine network can achieve the performance and isolation needed for multi-tenant AI deployments by utilizing hardware acceleration on NVIDIA BlueField DPUs and ConnectX SuperNICs, which incorporate technologies such as SR-IOV and ASAP2.

Asset security for the control plane and worker hosts#

_images/perf-reqs2.png — Figure 2 Shared responsibility model - NCP responsibility#

This architecture recommends isolating physical hosts from each other and the underlying control plane by grouping hosts with similar functions and placing these groups in separate virtual private clouds (VPCs). Typically, this involves creating at least 2 VPCs or equivalent groupings: one for the CPU-only control plane hosts, and one for the GPU worker hosts. Network fabric controls should be enforced to block unauthorized traffic between the groups and among hosts within each group.

Within a VPC, the hosts should be separated from each other with network fabric level stateful package filters (e.g. “security groups”).

All hosts should run an operating system and/or hypervisor configuration aligned with NVIDIA’s performance optimization requirements and hardened to meet or exceed industry best practices for the chosen type 1 or type 2 hypervisor architecture.

Security on the control plane and worker nodes#

_images/perf-reqs1.png — Figure 3 Shared responsibility model - tenant responsibility#

The security practices to separate control plane and worker nodes, that is the virtual machines (VMs) running on the host servers, must align with NVIDIA’s performance requirements. The hypervisor should be configured to schedule workloads in a way that prevents noisy neighbors, data leakage, or active data manipulation attacks between VMs. Additionally, robust memory management and Input Output Memory Management Unit (IOMMU) isolation controls must be leveraged to ensure users cannot interfere with each other, the host, or its hardware.

Physical devices passed into a virtual machine’s control domain, such as GPUs, must be properly reset and sanitized according to the manufacturer’s guidelines for the specific device. Devices should only be reassigned to the host after successful sanitization has been verified.

Virtual machine images and their configurations should adhere to industry best practices for security hardening and align with NVIDIA’s recommendations for GPU driver setup and performance optimization.

Workload Orchestration#

We use Kubernetes as the workload orchestration engine and this reference architecture is built around a Kubernetes architecture.

Storage Connectivity#

Storage should be external to the solution, and should be provided as a multi-tenant capable Storage as a Service PaaS product. This Storage Service pattern should be used to provide storage for control plane assets (e.g. virtual machines need to be stored in centralized storage) as well as tenant assets (e.g. block storage as a service), which can be connected to the pods for launching workloads and making associated data available at runtime.

Reference the High Performance Storage for detailed requirements for external storage implementation.

NVIDIA AI Enterprise#

NVIDIA AI Enterprise is a cloud-native software platform that streamlines development and deployment of production-grade AI solutions, including AI agents, computer vision, speech AI, and more. Easy-to-use microservices optimize model performance with enterprise-grade security, support, and stability, ensuring a smooth transition from prototype to production for enterprises that run their businesses on AI. Here are some of the key components of NVIDIA AI Enterprise that we have leveraged for this reference architecture.

NVIDIA GPU Operator automates the lifecycle management of the software required to use GPUs with Kubernetes. It takes care of the complexity that arises from managing the lifecycle of special resources like GPUs. It also handles all the configuration steps required to provision NVIDIA GPUs, making them as easy to scale as other resources. Advanced features of GPU Operator allow for better performance, higher utilization, and access to GPU telemetry. Certified and validated for compatibility with industry leading Kubernetes solutions, GPU Operator allows organizations to focus on building applications, rather than managing Kubernetes infrastructure.
The NVIDIA Network Operator simplifies the provisioning and management of NVIDIA networking resources in a Kubernetes cluster. The operator automatically installs the required host networking software - bringing together all the needed components to provide high-speed network connectivity with NVIDIA BlueField and ConnectX adapters. These components include the NVIDIA networking driver, Kubernetes device plugin, CNI plugins, IP address management (IPAM) plugin and others. The NVIDIA Network Operator works in conjunction with the NVIDIA GPU Operator to deliver high-throughput, low-latency networking for scale-out, GPU computing clusters. A Helm chart easily deploys the Network operator in a cluster to provision the host software on NVIDIA-enabled nodes.
GPUDirect® RDMA (GDR) technology is a NVIDIA BlueField and ConnectX feature that unlocks high-throughput, low-latency network connectivity to feed GPUs with data. GPUDirect RDMA allows efficient, zero-copy data transfers between GPUs using the hardware engines in the BlueField and ConnectX ASICs.
GPUDirect Storage (GDS) provides a direct path to local or remote storage (like NVMe or NVMe-oF) and GPU memory. NVIDIA ConnectX and BlueField adapters enable this direct communication within a distributed environment, when the GPU and storage media are not hosted in the same enclosure. GDS provides increased bandwidth, lower latency, and increased capacity between storage and GPUs. This is especially important, as dataset sizes no longer fit into system memory, and data IO to the GPUs becomes the runtime bottleneck. Enabling a direct path alleviates this bottleneck for scale-out AI and data science workloads.
NVIDIA NIM™, part of NVIDIA AI Enterprise, is a set of easy-to-use microservices designed for secure, reliable deployment of high-performance AI model inferencing across workstations, data centers, and the cloud. Supporting a wide range of AI models, including open-source community and NVIDIA AI Foundation models, NVIDIA NIM ensures seamless, scalable AI inferencing, on-premises or in the cloud, leveraging industry-standard APIs.
NVIDIA NeMo is a set of microservices that help enterprise AI developers to easily curate data at scale, customize LLMs with the popular fine-tuning techniques, evaluate models on standard and custom benchmarks, and guardrail them for appropriate and grounded outputs.
- NeMo Curator: A powerful microservice for enterprise developers to efficiently curate high-quality datasets for training LLMs, thereby enhancing model performance and accelerating the deployment of AI solutions.
- NeMo Customizer: A high-performance, scalable microservice that simplifies the fine-tuning and alignment of LLMs with popular parameter efficient fine tuning techniques including LoRA, DPO.
- NeMo Evaluator: An enterprise-grade microservice that provides industry-standard benchmarking of generative AI models, synthetic data generation, and end-to-end RAG pipelines.
- NeMo Guardrails: Microservice for developers to implement robust safety and security measures in LLM-based applications, ensuring that these applications remain reliable and aligned with organizational policies and guidelines.