NVIDIA Software Components

View as Markdown

This section provides detailed information on NVIDIA-provided software components that address the capabilities described in the Software Reference Guide. Each component is mapped to the architectural layer it supports. The use of NVIDIA software is optional and depends on architectural decisions made by the NCP or ISV. NCPs can work with ecosystem partners to integrate these components or implement alternative solutions.

This section is organized by functional area, mirroring the structure of the Software Reference Architecture section:

  • Infrastructure Platform:
    • Network Management – Software for managing Ethernet, InfiniBand, and NVLink fabrics
    • Compute Management – Software for bare metal lifecycle, GPU virtualization, and observability
    • Storage – Software for high-performance GPU-to-storage connectivity
  • Container Platform – Software for GPU-accelerated containers and Kubernetes
  • AI Platforms – Software for training and inference workload management

Key Software Components

Key software components provided by NVIDIA are listed in the following table.

Key NVIDIA Software Components

ComponentDescription
Virtual GPU SoftwareCommunicates with platform hardware to allocate GPU resources between host and guest
Fabric ManagerProgramming NVSwitch for high-performance multi-GPU workloads
NVIDIA DOCA™ softwareDOCA-OFED drivers and DOCA acceleration libraries and services to enable accelerated networking for AI workloads.
NVIDIA Data Centre GPU Manager (DCGM)DCGM provides GPU monitoring, diagnostics, and telemetry. Enables automated break-fix and infrastructure observability.
NVIDIA Infra ControllerNVIDIA Infra Controller is NVIDIA’s cloud-native bare metal provisioning platform that provides hardware lifecycle management, orchestrated by the DPU
Base Command ManagerManaging AI infrastructure through workload provisioning
Container ToolkitEnables container runtimes to access GPU hardware within containers
NVIDIA K8s OperatorsGPU Operator standardizes GPU management in K8s and enables better GPU performance, utilization, and telemetry. Network Operator simplifies the provisioning and management of NVIDIA networking resources in a K8s cluster. NIM Operator automates the lifecycle of NVIDIA NIM™ microservices for Generative AI applications. NVIDIA GPU Drivers that allow the GPU to run on K8s.
Run:aiOptimizes workload deployment by leveraging K8s orchestration
NVIDIA Cloud Functions (NVCF)A serverless API that allows users to deploy and manage AI workloads on GPUs, providing scalability, security, and reliability, accessible via HTTP polling, streaming, or gRPC protocols. K8s integration can be achieved with the NVIDIA Cluster Agent (NVCA).
NVIDIA Inference Microservices (NIM)A set of easy-to-use microservices designed for secure, reliable deployment of high-performance AI model inferencing across clouds, data centers, and workstations.
NVIDIA NeMo™ microservicesProvide end-to-end workflow for model customization, enabling enterprises to adapt large language models to their specific needs efficiently.

Additional software shown in the following table can be used for full infrastructure management that includes the networking components. These are detailed in NVIDIA Software for Infrastructure as a Service.

Additional NVIDIA Software for Infrastructure Components

ComponentLayerFunction
Unified Fabric Manager (UFM)Network ManagementManages Quantum InfiniBand switches through MLNX-OX
NVIDIA User Experience (NVUE)Network ManagementManages Spectrum Ethernet switches through Cumulus Linux
NetQMonitoring and visibilityProvides network and host visibility
NVIDIA AirDeployment validationSimulation environment that provides deployment validation
NMXNetwork ManagementManages NVSwitch-based NVLink interconnects. NMX has three components: NMX-C, NMX-M, and NMX-T.

The following components complement NVIDIA software and are selected to complete the stack. The following infrastructure components can be provided by NCPs, ISVs, or the open-source ecosystem:

Infrastructure Software Components

ComponentLayerDescription
Operating SystemIaaSLinux distribution for compute hosts
HypervisorIaaSAllocates physical host resources to guest virtual machines
Cloud Control PlaneIaaSTenant facing control plane providing API/UI to provision compute, networking, and storage
SDN controllerIaaSNetwork intent translation to hardware
Storage SystemIaaSBlock, file, object storage
Identity and Access ManagementIaaSTenant authentication and authorization
KubernetesCaaSContainer orchestration platform
SlurmCaaSHPC workload manager for job scheduling
PyTorchCaaSGPU-accelerated tensor computational framework with a Python front end
AI PlatformSaaSTenant-facing platform for training and inference workloads