> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/dsx/llms.txt.
> For full documentation content, see https://docs.nvidia.com/dsx/llms-full.txt.

# NVIDIA Software for Infrastructure as a Service

The following NVIDIA software supports the IaaS management domains
described in Part 1: Network Management, Compute Management, and Storage
Management.

## Network Management

### NVIDIA Unified Fabric Manager (UFM)

InfiniBand fabrics used for high performance GPU-to-GPU communication
require centralized management to monitor health, detect congestion,
isolate tenants, and resolve issues. The [NVIDIA UFM®](https://docs.nvidia.com/networking/software/management-software/index.html#ufm-enterprise) platform
revolutionizes data center networking management by combining enhanced,
real-time network telemetry with AI-powered cyber intelligence and
analytics to support scale-out, InfiniBand-connected data centers. UFM
Telemetry provides network validation tools to monitor network
performance and conditions. UFM Enterprise combines the benefits of UFM
Telemetry with enhanced network monitoring and management. UFM Cyber-AI
enhances the benefits of UFM Telemetry and UFM Enterprise, providing
preventive maintenance and cybersecurity.

### NVIDIA User Experience (NVUE)

Ethernet switches require consistent, programmable configuration across
large leaf-spine fabrics. Manual CLI configuration does not scale and is
error-prone. [NVIDIA User Experience (NVUE)](https://docs.nvidia.com/networking-ethernet-software/cumulus-linux-513/System-Configuration/NVIDIA-User-Experience-NVUE/) is software included with Cumulus Linux. It is a
utility that delivers a schema-driven model of a Cumulus Linux
system, including management of both hardware and software. NVUE has
multiple methods of interaction, including a command line, an API, and
an object model. NVUE is included with [Cumulus Linux](https://docs.nvidia.com/networking-ethernet-software/cumulus-linux-515/) and runs on each
Spectrum switch. SDN controllers can interact with NVUE's REST API to
push configurations, and operators can use the CLI for direct access.
Reference commands can be found in the [documentation](https://docs.nvidia.com/networking-ethernet-software/cumulus-linux-515/Whats-New/New-and-Changed-NVUE-Commands/#all-new-nvue-commands).

### NVIDIA NetQ

[NVIDIA NetQ™](https://www.nvidia.com/en-us/networking/ethernet-switching/netq/) is a highly scalable, modern network operations toolset
that provides visibility, troubleshooting, and validation of Cumulus
fabrics in real time. NetQ utilizes telemetry and delivers actionable
insights about the health of the data center network, integrating the
fabric into the DevOps ecosystem.

**NVIDIA NetQ Features**

| Feature                    | Description                                                                                                                                                 |
| -------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Network management         | Access powerful tools to manage your NVIDIA® Cumulus Linux™ and SONiC environments at the push of a button.                                                 |
| Advanced telemetry         | Collect real-time data that enables deep troubleshooting, visibility, and automated workflows from a single GUI.                                            |
| Snapshot and compare       | Easily compare previous network configurations to configurations after network changes are made to eliminate risk of disruption.                            |
| Network-wide visibility    | See real-time visualizations about the health of your network with the NetQ rich GUI.                                                                       |
| Flow telemetry             | Analyze fabric-wide latency and buffer occupancy data of all the paths of a 4-tuple or 5-tuple flow to identify congestion points.                          |
| Preventive validation      | Reduce manual errors before they're rolled into production.                                                                                                 |
| Diagnostic troubleshooting | Diagnose the root cause of state deviations with advanced diagnostic tools.                                                                                 |
| RoCE support               | Monitor remote direct memory access (RDMA) over Converged Ethernet (RoCE) environment with NetQ to gain actionable insights into high-performance networks. |

### NVIDIA NMX

GB200 and GB300 rack-scale systems use [NVSwitch-based NVLink](https://www.nvidia.com/en-us/data-center/nvlink/)
interconnects to provide high-bandwidth GPU-to-GPU communication within
a rack. These NVLink domains require dedicated management, separate from
traditional Ethernet and InfiniBand fabrics, for configuration, health
monitoring, and coordination across multi-rack deployments. [NVIDIA NMX](https://docs.nvidia.com/networking/display/nmxcv11/nmx+introduction)
is the management and analytics platform for NVSwitch-based
interconnects. It consists of three components that work together to
manage the NVLink scale-up network within GPU racks.

**NVIDIA NMX Components**

| Component | Name           | Where it runs               | Function                                                  |
| --------- | -------------- | --------------------------- | --------------------------------------------------------- |
| NMX-C     | NMX Controller | On each NVLink switch tray  | Per-rack controller that programs individual NVSwitches   |
| NMX-T     | NMX Telemetry  | On each NVLink switch tray  | Collects telemetry and metrics from NVLink switches       |
| NMX-M     | NMX Manager    | Centralized (control plane) | Aggregates management across multiple NVLink switch racks |
| NMX Oasis | Oasis          | Centralized                 | API gateways, ETL, and dashboards for telemetry data      |

### DOCA / HBN

Multi-tenant cloud environments require network isolation at the compute
edge. Traditional software-based overlays consume host CPU cycles and
add latency. By offloading SDN functions to a DPU, NCPs can enforce
tenant isolation without impacting workload performance, while also
providing a hardware-rooted security boundary between tenants and
infrastructure.

[DOCA](https://docs.nvidia.com/doca/sdk/doca-hbn-service-guide/index.html) is the SDK and runtime for NVIDIA® BlueField® DPUs.
The DOCA-OFED driver provides optimized networking on the host.
Host-Based Networking (HBN) offloads L2/L3 overlay networking (VXLAN,
EVPN) to the DPU, enabling tenant VPC isolation at the compute edge.

HBN also enables the Shared Services Network. The SDN Controller programs
the DPU with minimal route leaks, allowing tenants to reach shared
services while remaining isolated from each other.

## Compute Management

### NVIDIA Infra Controller

[NVIDIA Infra Controller](https://github.com/NVIDIA/ncx-infra-controller-core/tree/main) provides bare metal infrastructure
lifecycle management. Operating as a site-local component on Kubernetes,
NVIDIA Infra Controller automates the complete lifecycle from hardware
discovery to tenant-ready bare metal, essential for operators and NCPs
to support multi-tenant AI cloud platforms on rack-scale GB200/GB300.
Traditional cluster management tools require one deployment per tenant,
creating linear scaling challenges. NVIDIA Infra Controller addresses this by:

* DPU-based isolation — Hardware-enforced tenant separation via
  BlueField DPUs
* Shared infrastructure — Single NVIDIA Infra Controller instance
  manages all tenants
* API-first architecture — Enables ISV cloud portals with admin + tenant
  views

NVIDIA Infra Controller is designed as an API-first platform, exposing
a gRPC API that NCPs and ISVs integrate into their Cloud Control Plane
and orchestration systems. This enables:

* Programmatic control of all
  bare metal lifecycle operations
* Integration with existing NCP
  provisioning workflows
* JWT token authentication with IdP integration
  (for example, Keycloak) for RBAC
* Flexibility for NCPs to build differentiated
  services on top of NVIDIA Infra Controller primitives.

| Feature                      | Description                                                                |                                                                         |                                                                                       |                                                                        |
| ---------------------------- | -------------------------------------------------------------------------- | ----------------------------------------------------------------------- | ------------------------------------------------------------------------------------- | ---------------------------------------------------------------------- |
| Network Isolation            | - Ethernet: Configures VXLAN/EVPN overlays for tenant VPCs via HBN on DPU  | InfiniBand: Programs partition keys (P\_Keys) via UFM for CIN isolation | NVLink: Orchestrates NVLink partition creation via NMX-M for rack-scale GPU isolation |                                                                        |
| Machine Lifecycle management | - Hardware Discovery: Automated discovery of new hardware via Redfish Apis | Host-DPU pairing: Associates hosts with their attached Bluefield DPUs   | Machine Validation: Burn-in test to confirm host functionality                        | Firmware management: Automated updates for BMC, UEFI, and DPU Firmware |
| DPU Lifecycle management     | - DPU Provisioning: Installs DPU OS onto Bluefield DPUs                    | HBN: Deploys containerized networking stack                             |                                                                                       |                                                                        |
| Security                     | - Measured Boot and host attestation                                       | BMC/UEFI lockdown to prevent tenant access to management interfaces     | Tenant Sanitization between allocations                                               |                                                                        |

At the completion of the NVIDIA Infra Controller workflow, the NCP has a validated,
firmware-current, network-isolated bare metal host. This bare metal host
is ready for direct bare metal allocation to a tenant, or installation
of hypervisor for virtualized workloads.

From the operator's
perspective, NVIDIA Infra Controller transforms racked hardware into validated,
firmware-current, network-isolated bare metal; ready for direct tenant
allocation or hypervisor installation, or workload management platforms
like Base Command Manager. This completes the bare metal provisioning
stage, providing the foundation for all compute services above.

While NVIDIA Infra Controller is not required for hardware management, it is NVIDIA-provided
software that simplifies the management of multi-tenant hardware
management on rack-scale systems.

### Virtualization

For virtual machine-based deployment, the NCP/ISV deploys a
virtualization layer on a cloud-native provisioning layer like
NVIDIA Infra Controller, or native bare metal. The primary enabler of
virtualization is a Hypervisor. NVIDIA does not provide Hypervisor
software; however, NCPs and ISVs can select offerings from the ecosystem
like the KVM hypervisor based on operational requirements.

Request the NVIDIA Grace™ I/O Virtualization Guide (PID 1144496) from your NVIDIA
Sales Representative.

### NVIDIA vGPU for Compute

Multi-tenant AI infrastructure often requires sharing GPU resources
across virtual machines while maintaining isolation and performance.
NCPs need to virtualize GPUs to maximize hardware utilization, support
diverse tenant workloads, and provide flexible GPU allocation without
dedicating physical GPUs to each VM. NVIDIA vGPU software enables GPU
virtualization on leading hypervisor platforms.

* **Virtual GPU Manager** – Runs on the hypervisor host, manages GPU
  resource allocation
* **Guest Drivers** – Installed in each VM to enable GPU access

vGPU allows multiple VMs to share a single physical GPU with
configurable profiles that define the GPU memory and compute allocated
to each VM. This enables NCPs to offer fractional GPU instances to
tenants. For more information, read:
[NVIDIA AI Enterprise and NVIDIA vGPU for Compute](https://docs.nvidia.com/ai-enterprise/release-7/latest/infra-software/vgpu.html)

### Base Command Manager

[Base Command Manager](https://docs.nvidia.com/base-command-manager/index.html) is software that streamlines provisioning, workload
management, and infrastructure monitoring. It addresses many of the
requirements associated with managing AI infrastructure and the
workloads running on it. Only some of them are mentioned in this
section. The complete list is available at:
[Base Command Manager Feature Matrix](https://support.brightcomputing.com/feature-matrix/).
This software is included in the NVIDIA AI Enterprise suite. BCM operates as a workload
management layer on top of provisioned infrastructure. Compute nodes and
hypervisors can be provisioned via platform-specific tools, or through
Base Command Manager, which enables operating system deployment using
PXE. Base Command Manager can also manage [network infrastructure](https://support.brightcomputing.com/manuals/10/admin-manual.pdf), and
works exclusively with Cumulus Linux as the networking operating system.

Base Command Manager is not an operations platform. It is used for
provisioning and installing operating systems on the compute nodes
(Guest OS). Base Command Manager also has the capability of deploying
Slurm and K8s on these compute nodes to help build full orchestration
stacks. It can establish multiple networks and manage power control and
telemetry.

While Base Command Manager is not required by this document,
it is an NVIDIA-provided solution that simplifies the provisioning of
workloads on compute nodes. BCM is intended for single-tenant
deployments and is not used for multi-tenant NCP bare metal management.
Software components such as [NVIDIA Infra Controller](#nv-software-components-iaas-ncx-infra-controller) can be considered.

### Base Command Manager with Virtual Machines

This document leverages virtual machines to provide secure and
high-performance multitenancy. Thus, orchestrating and managing virtual
machines in the cluster is required.

Base Command Manager cannot
provision VMs directly. Instead, it creates and uses templates in a
cloud platform, and once the VMs are fully launched, they can be added
into the Base Command Manager Essentials management domain.

The
[Administrator Manual section 5.7 – Adding New Nodes](https://support.brightcomputing.com/manuals/10/admin-manual.pdf) describes how to
add virtual machines to Base Command Manager at scale. Additional
documentation is referenced in the
[Installation Manual section 1.3 – Booting Regular Nodes](https://support.brightcomputing.com/manuals/10/installation-manual.pdf), where virtual machines can be allocated locally.

While Base Command Manager is not required in this document for
provisioning and managing virtual machines, it is an NVIDIA-provided
solution that makes virtual machine management easier.
For more information, visit the following links:

* [NVIDIA Base Command Manager Documentation](https://docs.nvidia.com/base-command-manager/index.html)
* [NVIDIA Base Command Manager Administration Manual](https://support.brightcomputing.com/manuals/10/admin-manual.pdf)
* [NVIDIA Base Command Manager Installation Manual](https://support.brightcomputing.com/manuals/10/installation-manual.pdf)

### NVIDIA Mission Control

[NVIDIA Mission Control (NMC)](https://www.nvidia.com/en-us/data-center/mission-control/) provides similar provisioning and workload
management to BCM, with GB200/300 NVL72–specific support: Redfish API
for leak detection, liquid cooling and hardware monitoring, and NVLink
topology and management. NMC is the recommended management option for
DGX B200/B300 and DGX GB200/300 NVL72 systems. NMC is intended for
single-tenant or workload-layer use, not multi-tenant NCP metal
management.

The software components included with Mission Control can be found in
the [Mission Control SBOM](https://docs.nvidia.com/pdf/sbom-2-2-0.pdf). NMC is intended for single-tenant or
workload-layer use, not multi-tenant NCP metal management.

## Software References for Storage Management

### GPUDirect Storage (GDS)

[NVIDIA® GPUDirect® Storage (GDS)](https://docs.nvidia.com/gpudirect-storage/getting-started/contents.html) enables a direct data path for direct
memory access (DMA) transfers between GPU memory and storage, which
avoids a bounce buffer through the CPU. This direct path increases
system bandwidth and decreases the latency and utilization load on the
CPU.

### GPUDirect RDMA (GDR)

[GPUDirect RDMA](https://docs.nvidia.com/cuda/gpudirect-rdma/) is a technology that enables a direct path for data
exchange between the GPU and a third-party peer device using standard
features of PCI Express.