Software Stack#

A sample software stack is provided as a fully supported agnostic starting point that is independent of the underlying NVIDIA Certified Systems (hardware). Other supported software can be found in the NVIDIA AI Enterprise Software Support Matrix. The example software stack provides examples for, Operating System, Orchestration Platform, Container Runtime, and NVIDIA Infrastructure Software.

Platform Software#

The following platform software is used as an agnostic starting point for running NVIDIA AI Enterprise workloads.

Operating System	Ubuntu
Orchestration Platorm	Upstream Kubernetes
Compute Engine	Containerd

Supported versions of platform software for a given NVIDIA AI Enterprise Release can be found in the NVIDIA AI Enterprise Software Support Matrix.

NVIDIA Infrastructure Software#

NVIDIA delivers infrastructure software for running workloads in development and production environments. The software used is hardware agnostic, i.e. the same software can be used regardless of the underlying hardware, netowrking, or reference architecture provided by NVIDIA for enterprise deplyoments.

The NVIDIA Driver, Container Toolkit, Kubernetes device plugin are provisioned before GPU resources are available to the cluster. These software components allow the GPU to run on Kubernetes.
NVIDIA GPU Operator automates the lifecycle management of the software required to use GPUs with Kubernetes. It takes care of the complexity that arises from managing the lifecycle of special resources like GPUs. It also handles all the configuration steps required to provision NVIDIA GPUs, making them as easy to scale as other resources. Advanced features of GPU Operator allow for better performance, higher utilization, and access to GPU telemetry. Certified and validated for compatibility with industry leading Kubernetes solutions, GPU Operator allows organizations to focus on building applications, rather than managing Kubernetes infrastructure.
The NVIDIA Network Operator simplifies the provisioning and management of NVIDIA networking resources in a Kubernetes cluster. The operator automatically installs the required host networking software - bringing together all the needed components to provide high-speed network connectivity. These components include the NVIDIA networking driver, Kubernetes device plugin, CNI plugins, IP address management (IPAM) plugin and others. The NVIDIA Network Operator works in conjunction with the NVIDIA GPU Operator to deliver high-throughput, low-latency networking for scale-out, GPU computing clusters. A Helm chart easily deploys the Network operator in a cluster to provision the host software on NVIDIA-enabled nodes.
The NVIDIA DOCA Driver for Networking - is provisioned before network resources are available to the cluster. These software components allow the NIC, Smart NICs, & DPUs to run on Kubernetes.
GPUDirect® RDMA (GDR) technology is a BlueField-3 feature that unlocks high-throughput, low-latency network connectivity to feed GPUs with data. GPUDirect RDMA allows efficient, zero-copy data transfers between GPUs using the hardware engines in the BlueField-3 ASIC.
GPUDirect Storage (GDS) provides a direct path to local or remote storage (like NVMe or NVMe-oF) and GPU memory. BlueField-3 enables this direct communication within a distributed environment, when the GPU and storage media are not hosted in the same enclosure. BlueField-3 GDS provides increased bandwidth, lower latency, and increased capacity between storage and GPUs. This is especially important, as dataset sizes no longer fit into system memory, and data IO to the GPUs becomes the runtime bottleneck. Enabling a direct path alleviates this bottleneck for scale-out AI and data science workloads.
NVIDIA NIM™, part of NVIDIA AI Enterprise, is a set of easy-to-use microservices designed for secure, reliable deployment of high-performance AI model inferencing across workstations, data centers, and the cloud. Supporting a wide range of AI models, including open-source community and NVIDIA AI Foundation models, NVIDIA NIM ensures seamless, scalable AI inferencing, on-premises or in the cloud, leveraging industry-standard APIs.
NVIDIA NIM Operator, automates the deployment and lifecycle management of generative AI applications built with NVIDIA NIM microservices on Kubernetes. NIM Operator delivers a better MLOps/LLMOps experience and improves performance by abstracting the deployment, configuration, and management of NIM microservices, allowing users to focus on the end to end application.

Supported versions of NVIDIA Infrastructure software for a given NVIDIA AI Enterprise Release can be found in the NVIDIA AI Enterprise Software Support Matrix.

Note

NVIDIA AI Enterprise includes additional software for building and running applications and is intended to run on top of the NVIDIA infrastructure software. A complete list of NVIDIA AI Enterprise supported software can be found on NGC.

Deployment Software#

To install this software stack, NVIDIA provides the Cloud Native Stack (CNS) tooling to get started quickly. Additionally, enterprises can leverage Base Command Manager Essentials for software deployment, management, and lifecycle. The software used is consistent across hardware platforms and is capable of scaling up from one to many nodes.