NVIDIA Software Components#

This section provides detailed information on NVIDIA-provided software components that address the capabilities described in NCP Software Reference Guide. Each component is mapped to the architectural layer it supports. The use of NVIDIA software is optional and depends on architectural decisions made by the NCP or ISV. NCPs can work with ecosystem partners to integrate these components or implement alternative solutions.

This section is organized by functional area, mirroring the structure of the Software Reference Architecture section:

Infrastructure Platform:
- Network Management – Software for managing Ethernet, InfiniBand, and NVLink fabrics
- Compute Management – Software for bare metal lifecycle, GPU virtualization, and observability
- Storage – Software for high-performance GPU-to-storage connectivity
Container Platform – Software for GPU-accelerated containers and Kubernetes
AI Platforms – Software for training and inference workload management

Key Software Components#

Key software components provided by NVIDIA are listed in the following table.

Key NVIDIA Software Components#
Component	Description
Virtual GPU Software	Communicates with platform hardware to allocate GPU resources between host and guest
Fabric Manager	Programming NVSwitch for high-performance multi-GPU workloads
NVIDIA DOCA™ software	DOCA-OFED drivers and DOCA acceleration libraries and services to enable accelerated networking for AI workloads.
NVIDIA Data Centre GPU Manager (DCGM)	DCGM provides GPU monitoring, diagnostics, and telemetry. Enables automated break-fix and infrastructure observability.
NVIDIA NCX Infra Controller	NVIDIA NCX Infra Controller is NVIDIA’s cloud-native bare metal provisioning platform that provides hardware lifecycle management, orchestrated by the DPU
Base Command Manager	Managing AI infrastructure through workload provisioning
Container Toolkit	Enables container runtimes to access GPU hardware within containers
NVIDIA K8s Operators	GPU Operator standardizes GPU management in K8s and enables better GPU performance, utilization, and telemetry. Network Operator simplifies the provisioning and management of NVIDIA networking resources in a K8s cluster. NIM Operator automates the lifecycle of NVIDIA NIM™ microservices for Generative AI applications. NVIDIA GPU Drivers that allow the GPU to run on K8s.
Run:ai	Optimizes workload deployment by leveraging K8s orchestration
NVIDIA Cloud Functions (NVCF)	A serverless API that allows users to deploy and manage AI workloads on GPUs, providing scalability, security, and reliability, accessible via HTTP polling, streaming, or gRPC protocols. K8s integration can be achieved with the NVIDIA Cluster Agent (NVCA).
NVIDIA Inference Microservices (NIM)	A set of easy-to-use microservices designed for secure, reliable deployment of high-performance AI model inferencing across clouds, data centers, and workstations.
NVIDIA NeMo™ microservices	Provide end-to-end workflow for model customization, enabling enterprises to adapt large language models to their specific needs efficiently.

Additional software shown in the following table can be used for full infrastructure management that includes the networking components. These are detailed in NVIDIA Software for Infrastructure as a Service.

Additional NVIDIA Software for Infrastructure Components#
Component	Layer	Function
Unified Fabric Manager (UFM)	Network Management	Manages Quantum InfiniBand switches through MLNX-OX
NVIDIA User Experience (NVUE)	Network Management	Manages Spectrum Ethernet switches through Cumulus Linux
NetQ	Monitoring and visibility	Provides network and host visibility
NVIDIA Air	Deployment validation	Simulation environment that provides deployment validation
NMX	Network Management	Manages NVSwitch-based NVLink interconnects. NMX has three components: NMX-C, NMX-M, and NMX-T.

The following components complement NVIDIA software and are selected to complete the stack. The following infrastructure components can be provided by NCPs, ISVs, or the open-source ecosystem:

Infrastructure Software Components#
Component	Layer	Description
Operating System	IaaS	Linux distribution for compute hosts
Hypervisor	IaaS	Allocates physical host resources to guest virtual machines
Cloud Control Plane	IaaS	Tenant facing control plane providing API/UI to provision compute, networking, and storage
SDN controller	IaaS	Network intent translation to hardware
Storage System	IaaS	Block, file, object storage
Identity and Access Management	IaaS	Tenant authentication and authorization
Kubernetes	CaaS	Container orchestration platform
Slurm	CaaS	HPC workload manager for job scheduling
PyTorch	CaaS	GPU-accelerated tensor computational framework with a Python front end
AI Platform	SaaS	Tenant-facing platform for training and inference workloads