NVIDIA Software for Infrastructure as a Service#
The following NVIDIA software supports the IaaS management domains described in Part 1: Network Management, Compute Management, and Storage Management.
Network Management#
NVIDIA Unified Fabric Manager (UFM)#
InfiniBand fabrics used for high performance GPU-to-GPU communication require centralized management to monitor health, detect congestion, isolate tenants, and resolve issues. The NVIDIA UFM® platform revolutionizes data center networking management by combining enhanced, real-time network telemetry with AI-powered cyber intelligence and analytics to support scale-out, InfiniBand-connected data centers. UFM Telemetry provides network validation tools to monitor network performance and conditions. UFM Enterprise combines the benefits of UFM Telemetry with enhanced network monitoring and management. UFM Cyber-AI enhances the benefits of UFM Telemetry and UFM Enterprise, providing preventive maintenance and cybersecurity.
NVIDIA User Experience (NVUE)#
Ethernet switches require consistent, programmable configuration across large leaf-spine fabrics. Manual CLI configuration does not scale and is error-prone. NVIDIA User Experience (NVUE) is software included with Cumulus Linux. It is a utility that delivers a schema-driven model of a Cumulus Linux system, including management of both hardware and software. NVUE has multiple methods of interaction, including a command line, an API, and an object model. NVUE is included with Cumulus Linux and runs on each Spectrum switch. SDN controllers can interact with NVUE’s REST API to push configurations, and operators can use the CLI for direct access. Reference commands can be found in the documentation.
NVIDIA NetQ#
NVIDIA NetQ™ is a highly scalable, modern network operations toolset that provides visibility, troubleshooting, and validation of Cumulus fabrics in real time. NetQ utilizes telemetry and delivers actionable insights about the health of the data center network, integrating the fabric into the DevOps ecosystem.
Feature |
Description |
|---|---|
Network management |
Access powerful tools to manage your NVIDIA® Cumulus Linux™ and SONiC environments at the push of a button. |
Advanced telemetry |
Collect real-time data that enables deep troubleshooting, visibility, and automated workflows from a single GUI. |
Snapshot and compare |
Easily compare previous network configurations to configurations after network changes are made to eliminate risk of disruption. |
Network-wide visibility |
See real-time visualizations about the health of your network with the NetQ rich GUI. |
Flow telemetry |
Analyze fabric-wide latency and buffer occupancy data of all the paths of a 4-tuple or 5-tuple flow to identify congestion points. |
Preventive validation |
Reduce manual errors before they’re rolled into production. |
Diagnostic troubleshooting |
Diagnose the root cause of state deviations with advanced diagnostic tools. |
RoCE support |
Monitor remote direct memory access (RDMA) over Converged Ethernet (RoCE) environment with NetQ to gain actionable insights into high-performance networks. |
NVIDIA NMX#
GB200 and GB300 rack-scale systems use NVSwitch-based NVLink interconnects to provide high-bandwidth GPU-to-GPU communication within a rack. These NVLink domains require dedicated management, separate from traditional Ethernet and InfiniBand fabrics, for configuration, health monitoring, and coordination across multi-rack deployments. NVIDIA NMX is the management and analytics platform for NVSwitch-based interconnects. It consists of three components that work together to manage the NVLink scale-up network within GPU racks.
Component |
Name |
Where it runs |
Function |
|---|---|---|---|
NMX-C |
NMX Controller |
On each NVLink switch tray |
Per-rack controller that programs individual NVSwitches |
NMX-T |
NMX Telemetry |
On each NVLink switch tray |
Collects telemetry and metrics from NVLink switches |
NMX-M |
NMX Manager |
Centralized (control plane) |
Aggregates management across multiple NVLink switch racks |
NMX Oasis |
Oasis |
Centralized |
API gateways, ETL, and dashboards for telemetry data |
DOCA / HBN#
Multi-tenant cloud environments require network isolation at the compute edge. Traditional software-based overlays consume host CPU cycles and add latency. By offloading SDN functions to a DPU, NCPs can enforce tenant isolation without impacting workload performance, while also providing a hardware-rooted security boundary between tenants and infrastructure.
DOCA is the SDK and runtime for NVIDIA® BlueField® DPUs. The DOCA-OFED driver provides optimized networking on the host. Host-Based Networking (HBN) offloads L2/L3 overlay networking (VXLAN, EVPN) to the DPU, enabling tenant VPC isolation at the compute edge.
HBN also enables the Shared Services Network. The SDN Controller programs the DPU with minimal route leaks, allowing tenants to reach shared services while remaining isolated from each other.
Compute Management#
NVIDIA Bare Metal Manager#
NVIDIA Bare Metal Manager (BMM) provides bare metal infrastructure lifecycle management. Operating as a site-local component on Kubernetes, NVIDIA Bare Metal Manager automates the complete lifecycle from hardware discovery to tenant-ready bare metal, essential for operators and NCPs to support multi-tenant AI cloud platforms on rack-scale GB200/GB300. Traditional cluster management tools require one deployment per tenant, creating linear scaling challenges. BMM addresses this by:
DPU-based isolation — Hardware-enforced tenant separation via BlueField DPUs
Shared infrastructure — Single NVIDIA Bare Metal Manager instance manages all tenants
API-first architecture — Enables ISV cloud portals with admin + tenant views
NVIDIA Bare Metal Manager is designed as an API-first platform, exposing a gRPC API that NCPs and ISVs integrate into their Cloud Control Plane and orchestration systems. This enables:
Programmatic control of all bare metal lifecycle operations
Integration with existing NCP provisioning workflows
JWT token authentication with IdP integration (for example, Keycloak) for RBAC
Flexibility for NCPs to build differentiated services on top of NVIDIA Bare Metal Manager primitives.
Feature |
Description |
|---|---|
Network Isolation |
|
Machine Lifecycle management |
|
DPU Lifecycle management |
|
Security |
|
At the completion of the BMM workflow, the NCP has a validated, firmware-current, network-isolated bare metal host. This bare metal host is ready for direct bare metal allocation to a tenant, or installation of hypervisor for virtualized workloads.
From the operator’s perspective, BMM transforms racked hardware into validated, firmware-current, network-isolated bare metal; ready for direct tenant allocation or hypervisor installation, or workload management platforms like Base Command Manager. This completes the bare metal provisioning stage, providing the foundation for all compute services above.
While BMM is not required for hardware management, it is NVIDIA-provided software that simplifies the management of multi-tenant hardware management on rack-scale systems.
Virtualization#
For virtual machine-based deployment, the NCP/ISV deploys a virtualization layer on a cloud-native provisioning layer like NVIDIA Bare Metal Manager, or native bare metal. The primary enabler of virtualization is a Hypervisor. NVIDIA does not provide Hypervisor software; however, NCPs and ISVs can select offerings from the ecosystem like the KVM hypervisor based on operational requirements.
Request the NVIDIA Grace™ I/O Virtualization Guide (PID 1144496) from your NVIDIA Sales Representative.
NVIDIA vGPU for Compute#
Multi-tenant AI infrastructure often requires sharing GPU resources across virtual machines while maintaining isolation and performance. NCPs need to virtualize GPUs to maximize hardware utilization, support diverse tenant workloads, and provide flexible GPU allocation without dedicating physical GPUs to each VM. NVIDIA vGPU software enables GPU virtualization on leading hypervisor platforms.
Virtual GPU Manager – Runs on the hypervisor host, manages GPU resource allocation
Guest Drivers – Installed in each VM to enable GPU access
vGPU allows multiple VMs to share a single physical GPU with configurable profiles that define the GPU memory and compute allocated to each VM. This enables NCPs to offer fractional GPU instances to tenants. For more information, read: NVIDIA AI Enterprise and NVIDIA vGPU for Compute
Base Command Manager#
Base Command Manager is software that streamlines provisioning, workload management, and infrastructure monitoring. It addresses many of the requirements associated with managing AI infrastructure and the workloads running on it. Only some of them are mentioned in this section. The complete list is available at: Base Command Manager Feature Matrix. This software is included in the NVIDIA AI Enterprise suite. BCM operates as a workload management layer on top of provisioned infrastructure. Compute nodes and hypervisors can be provisioned via platform-specific tools, or through Base Command Manager, which enables operating system deployment using PXE. Base Command Manager can also manage network infrastructure, and works exclusively with Cumulus Linux as the networking operating system.
Base Command Manager is not an operations platform. It is used for provisioning and installing operating systems on the compute nodes (Guest OS). Base Command Manager also has the capability of deploying Slurm and K8s on these compute nodes to help build full orchestration stacks. It can establish multiple networks and manage power control and telemetry.
While Base Command Manager is not required by this document, it is an NVIDIA-provided solution that simplifies the provisioning of workloads on compute nodes. BCM is intended for single-tenant deployments and is not used for multi-tenant NCP bare-metal management. Software components such as NVIDIA Bare Metal Manager can be considered.
Base Command Manager with Virtual Machines#
This document leverages virtual machines to provide secure and high-performance multitenancy. Thus, orchestrating and managing virtual machines in the cluster is required.
Base Command Manager cannot provision VMs directly. Instead, it creates and uses templates in a cloud platform, and once the VMs are fully launched, they can be added into the Base Command Manager Essentials management domain.
The Administrator Manual section 5.7 – Adding New Nodes describes how to add virtual machines to Base Command Manager at scale. Additional documentation is referenced in the Installation Manual section 1.3 – Booting Regular Nodes, where virtual machines can be allocated locally.
While Base Command Manager is not required in this document for provisioning and managing virtual machines, it is an NVIDIA-provided solution that makes virtual machine management easier. For more information, visit the following links:
NVIDIA Mission Control#
NVIDIA Mission Control (NMC) provides similar provisioning and workload management to BCM, with GB200/300 NVL72–specific support: Redfish API for leak detection, liquid cooling and hardware monitoring, and NVLink topology and management. NMC is the recommended management option for DGX B200/B300 and DGX GB200/300 NVL72 systems. NMC is intended for single-tenant or workload-layer use, not multi-tenant NCP metal management.
The software components included with Mission Control can be found in the Mission Control SBOM. NMC is intended for single-tenant or workload-layer use, not multi-tenant NCP metal management.
Software References for Storage Management#
GPUDirect Storage (GDS)#
NVIDIA® GPUDirect® Storage (GDS) enables a direct data path for direct memory access (DMA) transfers between GPU memory and storage, which avoids a bounce buffer through the CPU. This direct path increases system bandwidth and decreases the latency and utilization load on the CPU.
GPUDirect RDMA (GDR)#
GPUDirect RDMA is a technology that enables a direct path for data exchange between the GPU and a third-party peer device using standard features of PCI Express.