NVIDIA Software Components#

This section provides detailed information on NVIDIA-provided software components that address the capabilities described in NCP Software Reference Guide. Each component is mapped to the architectural layer it supports. The use of NVIDIA software is optional and depends on architectural decisions made by the NCP or ISV. NCPs can work with ecosystem partners to integrate these components or implement alternative solutions.

This section is organized by functional area, mirroring the structure of the Software Reference Architecture section:

  • Infrastructure Platform:

    • Network Management – Software for managing Ethernet, InfiniBand, and NVLink fabrics

    • Compute Management – Software for bare metal lifecycle, GPU virtualization, and observability

    • Storage – Software for high-performance GPU-to-storage connectivity

  • Container Platform – Software for GPU-accelerated containers and Kubernetes

  • AI Platforms – Software for training and inference workload management

Key Software Components#

Key software components provided by NVIDIA are listed in the following table.

Key NVIDIA Software Components#

Component

Description

Virtual GPU Software

Communicates with platform hardware to allocate GPU resources between host and guest

Fabric Manager

Programming NVSwitch for high-performance multi-GPU workloads

NVIDIA DOCA™ software

DOCA-OFED drivers and DOCA acceleration libraries and services to enable accelerated networking for AI workloads.

NVIDIA Data Centre GPU Manager (DCGM)

DCGM provides GPU monitoring, diagnostics, and telemetry. Enables automated break-fix and infrastructure observability.

NVIDIA Bare Metal Manager

NVIDIA Bare Metal Manager is NVIDIA’s cloud-native bare metal provisioning platform that provides hardware lifecycle management, orchestrated by the DPU

Base Command Manager

Managing AI infrastructure through workload provisioning

Container Toolkit

Enables container runtimes to access GPU hardware within containers

NVIDIA K8s Operators

GPU Operator standardizes GPU management in K8s and enables better GPU performance, utilization, and telemetry. Network Operator simplifies the provisioning and management of NVIDIA networking resources in a K8s cluster. NIM Operator automates the lifecycle of NVIDIA NIM™ microservices for Generative AI applications. NVIDIA GPU Drivers that allow the GPU to run on K8s.

Run:ai

Optimizes workload deployment by leveraging K8s orchestration

NVIDIA Cloud Functions (NVCF)

A serverless API that allows users to deploy and manage AI workloads on GPUs, providing scalability, security, and reliability, accessible via HTTP polling, streaming, or gRPC protocols. K8s integration can be achieved with the NVIDIA Cluster Agent (NVCA).

NVIDIA Inference Microservices (NIM)

A set of easy-to-use microservices designed for secure, reliable deployment of high-performance AI model inferencing across clouds, data centers, and workstations.

NVIDIA NeMo™ microservices

Provide end-to-end workflow for model customization, enabling enterprises to adapt large language models to their specific needs efficiently.

Additional software shown in the following table can be used for full infrastructure management that includes the networking components. These are detailed in NVIDIA Software for Infrastructure as a Service.

Additional NVIDIA Software for Infrastructure Components#

Component

Layer

Function

Unified Fabric Manager (UFM)

Network Management

Manages Quantum InfiniBand switches through MLNX-OX

NVIDIA User Experience (NVUE)

Network Management

Manages Spectrum Ethernet switches through Cumulus Linux

NetQ

Monitoring and visibility

Provides network and host visibility

NVIDIA Air

Deployment validation

Simulation environment that provides deployment validation

NMX

Network Management

Manages NVSwitch-based NVLink interconnects. NMX has three components: NMX-C, NMX-M, and NMX-T.

The following components complement NVIDIA software and are selected to complete the stack. The following infrastructure components can be provided by NCPs, ISVs, or the open-source ecosystem:

Infrastructure Software Components#

Component

Layer

Description

Operating System

IaaS

Linux distribution for compute hosts

Hypervisor

IaaS

Allocates physical host resources to guest virtual machines

Cloud Control Plane

IaaS

Tenant facing control plane providing API/UI to provision compute, networking, and storage

SDN controller

IaaS

Network intent translation to hardware

Storage System

IaaS

Block, file, object storage

Identity and Access Management

IaaS

Tenant authentication and authorization

Kubernetes

CaaS

Container orchestration platform

Slurm

CaaS

HPC workload manager for job scheduling

PyTorch

CaaS

GPU-accelerated tensor computational framework with a Python front end

AI Platform

SaaS

Tenant-facing platform for training and inference workloads