> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/dsx/llms.txt.
> For full documentation content, see https://docs.nvidia.com/dsx/llms-full.txt.

# NVIDIA Software for Container as a Service

## Provisioning Kubernetes

There are multiple ways to install K8s. NVIDIA provides two methods,
but multiple ecosystem options for installing K8s are also available.
Numerous do-it-yourself open-source or third-party vendor solutions
exist for installing K8s on a cluster and are documented in upstream
[Kubernetes documentation](https://kubernetes.io/docs/setup/production-environment/tools/) or vendor's official documentation. K8s can
also be installed on provisioned compute nodes using Base Command
Manager Essentials. The instructions are documented in the
[Containerization Manual, Section 4.2 – Kubernetes Setup](https://support.brightcomputing.com/manuals/10/containerization-manual.pdf). Base Command
Manager Essentials handles all K8s components, including:

* Networking fundamentals
* Container networking interface
* `kubeadm` components
* CoreDNS
* NGINX ingress controllers
* Dashboard
* Metrics server

Additionally, Base Command Manager also installs Helm as a package
manager to streamline workload orchestration.

Using Base Command Manager (BCM) to install Kubernetes can create challenges when building hosting or multi-tenant environments for NCPs. BCM does not provide separation between the operator and tenant roles. A single management plane is used for both cluster provisioing and tenant workloads. BCM-managed Kubernetes is well suited to single-tenant or dedicated clusters. For hosting use-cases where operator and tenant boundaries must be enforced, the architecture described in the [Data Center View](/dsx/guides/ncp-software-reference-guide/data-center-architecture#data-center-view) section of this document must be used to guide deployment choices.

NVIDIA provides a public repository called [Cloud Native Stack (CNS)](https://github.com/NVIDIA/cloud-native-stack/),
which installs K8s along with specific components that work with NVIDIA
GPUs. It can be one of the easiest and fastest ways to deploy the
AI-ready K8s stack in test and PoC environments. CNS is not recommended
for use in production, but it may be used as a reference architecture
that lists all components validated together. This CNS reference
architecture should be used as a specification for production
deployments.

## Dynamic Resource Allocation (DRA)

Kubernetes (v1.34+) manages the allocation of GPUs (with RoCE NICs) that
require virtualization support, drivers, and sharing capabilities.
NVIDIA technologies such as Multi-Instance GPU (MIG) and Multi-Process
Service (MPS) further enhance GPU utilization by partitioning a single
physical GPU into independent instances, thus allowing efficient sharing
of resources.

## IMEX (Multi-Node NVLink)

The Kubernetes scheduler is evolving to better support GPU sharing and
multi-tenancy, with APIs for remote (non-node-local) resources such as
multi-node NVLink, accessed via IMEX. IMEX enables Kubernetes to schedule
workloads that span multiple nodes connected via NVLink (GB200 NVL72).
This is critical for:

* Large model training requiring more GPUs than a single node
* Disaggregated inference with GPU pooling
* Topology-aware gang scheduling for collective operations.

IMEX requires NMX-M for NVLink partition management in multi-tenant environments.

## GPU Operator

To ensure containers have all required drivers, libraries, and runtimes,
NVIDIA provides a [GPU operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/overview.html) as part of the K8s framework to simplify
deployment and lifecycle management. GPU operator is a requirement to
use these clusters with NVCF and, in association with, NIMs in K8s.

The GPU
operator can be installed automatically by Base Command Manager
Essentials on a new cluster ([Containerization Manual section 4.3.2](https://support.brightcomputing.com/manuals/10/containerization-manual.pdf)) or
installed reactively on an existing cluster ([Containerization Manual
section 4.3.3](https://support.brightcomputing.com/manuals/10/containerization-manual.pdf)). This operator can also be installed using traditional
Kubernetes methods such as [Helm charts](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/25.3.2/getting-started.html).

## Container Toolkit

For container environments that use Docker for orchestration, NVIDIA®
Container Toolkit is used to optimize the deployment of containers to
use GPUs. NVIDIA Container Toolkit is [installed](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) using traditional Linux
package managers.

## Network Operator

NVIDIA Network Operator simplifies the provisioning and management of
NVIDIA networking resources in a Kubernetes cluster. It offers [support](https://docs.nvidia.com/networking/display/kubernetes2570/index.html#networking-features)
for RDMA, SR-IOV, and driver management. The network operator
specifically targets [NVIDIA ConnectX®-6, NVIDIA® ConnectX®-7, and NVIDIA® BlueField®](https://docs.nvidia.com/networking/display/kubernetes2570/platform-support.html)
families of NICs.