NVIDIA Software Components#
This section provides detailed information on NVIDIA-provided software components that address the capabilities described in NCP Software Reference Guide. Each component is mapped to the architectural layer it supports. The use of NVIDIA software is optional and depends on architectural decisions made by the NCP or ISV. NCPs can work with ecosystem partners to integrate these components or implement alternative solutions.
This section is organized by functional area, mirroring the structure of the Software Reference Architecture section:
Infrastructure Platform:
Network Management – Software for managing Ethernet, InfiniBand, and NVLink fabrics
Compute Management – Software for bare metal lifecycle, GPU virtualization, and observability
Storage – Software for high-performance GPU-to-storage connectivity
Container Platform – Software for GPU-accelerated containers and Kubernetes
AI Platforms – Software for training and inference workload management
Key Software Components#
Key software components provided by NVIDIA are listed in the following table.
Component |
Description |
|---|---|
Virtual GPU Software |
Communicates with platform hardware to allocate GPU resources between host and guest |
Fabric Manager |
Programming NVSwitch for high-performance multi-GPU workloads |
NVIDIA DOCA™ software |
DOCA-OFED drivers and DOCA acceleration libraries and services to enable accelerated networking for AI workloads. |
NVIDIA Data Centre GPU Manager (DCGM) |
DCGM provides GPU monitoring, diagnostics, and telemetry. Enables automated break-fix and infrastructure observability. |
NVIDIA Bare Metal Manager |
NVIDIA Bare Metal Manager is NVIDIA’s cloud-native bare metal provisioning platform that provides hardware lifecycle management, orchestrated by the DPU |
Base Command Manager |
Managing AI infrastructure through workload provisioning |
Container Toolkit |
Enables container runtimes to access GPU hardware within containers |
NVIDIA K8s Operators |
GPU Operator standardizes GPU management in K8s and enables better GPU performance, utilization, and telemetry. Network Operator simplifies the provisioning and management of NVIDIA networking resources in a K8s cluster. NIM Operator automates the lifecycle of NVIDIA NIM™ microservices for Generative AI applications. NVIDIA GPU Drivers that allow the GPU to run on K8s. |
Run:ai |
Optimizes workload deployment by leveraging K8s orchestration |
NVIDIA Cloud Functions (NVCF) |
A serverless API that allows users to deploy and manage AI workloads on GPUs, providing scalability, security, and reliability, accessible via HTTP polling, streaming, or gRPC protocols. K8s integration can be achieved with the NVIDIA Cluster Agent (NVCA). |
NVIDIA Inference Microservices (NIM) |
A set of easy-to-use microservices designed for secure, reliable deployment of high-performance AI model inferencing across clouds, data centers, and workstations. |
NVIDIA NeMo™ microservices |
Provide end-to-end workflow for model customization, enabling enterprises to adapt large language models to their specific needs efficiently. |
Additional software shown in the following table can be used for full infrastructure management that includes the networking components. These are detailed in NVIDIA Software for Infrastructure as a Service.
Component |
Layer |
Function |
|---|---|---|
Network Management |
Manages Quantum InfiniBand switches through MLNX-OX |
|
Network Management |
Manages Spectrum Ethernet switches through Cumulus Linux |
|
Monitoring and visibility |
Provides network and host visibility |
|
Deployment validation |
Simulation environment that provides deployment validation |
|
Network Management |
Manages NVSwitch-based NVLink interconnects. NMX has three components: NMX-C, NMX-M, and NMX-T. |
The following components complement NVIDIA software and are selected to complete the stack. The following infrastructure components can be provided by NCPs, ISVs, or the open-source ecosystem:
Component |
Layer |
Description |
|---|---|---|
Operating System |
IaaS |
Linux distribution for compute hosts |
Hypervisor |
IaaS |
Allocates physical host resources to guest virtual machines |
Cloud Control Plane |
IaaS |
Tenant facing control plane providing API/UI to provision compute, networking, and storage |
SDN controller |
IaaS |
Network intent translation to hardware |
Storage System |
IaaS |
Block, file, object storage |
Identity and Access Management |
IaaS |
Tenant authentication and authorization |
Kubernetes |
CaaS |
Container orchestration platform |
Slurm |
CaaS |
HPC workload manager for job scheduling |
PyTorch |
CaaS |
GPU-accelerated tensor computational framework with a Python front end |
AI Platform |
SaaS |
Tenant-facing platform for training and inference workloads |