> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/dsx/llms.txt.
> For full documentation content, see https://docs.nvidia.com/dsx/llms-full.txt.

# Container-as-a-Service

## Kubernetes

[Kubernetes](https://kubernetes.io/) (K8s) is used as the workload orchestration
engine, and this RA is built around a K8s architecture.

Kubernetes is a container orchestration tool that has become a de facto
standard for operating cloud environments at scale due to its
flexibility and scalability. As defined by the Cloud Native Computing
Foundation (CNCF), Cloud Native technologies enable organizations to
operate scalable applications within modern, dynamic environments using
containers, service meshes, microservices, immutable infrastructure, and
declarative APIs. This approach yields resilient, manageable, and
observable systems, facilitating predictable adaptability with minimal
operational overhead. For AI infrastructure providers, Kubernetes offers
the foundational abstraction that can support the demanding scale of
ML/AI inference and training workloads.

Kubernetes is utilized for both training and deploying AI models when
they are packaged into containers using Open Container Initiative
compliant standards such as NVIDIA's extensible inference microservices
(NIM). Containerization is particularly vital for AI workloads because
different models require distinct and potentially conflicting
dependencies, so isolating these dependencies within containers provides
flexibility in model deployments.

The default K8s uses the `containerd` runtime for container
orchestration. The operator may further choose to support the deployment
option of running certain workloads that may require the strictest
isolation in dedicated nodes on dedicated physical hosts.

The tenant's K8s cluster is configured to retrieve code and data from
NVIDIA NGC™ Container Registry that the respective tenant has access
to. For this, the tenant provides their NVIDIA AI Enterprise access
credentials to the Operator's K8s cluster provisioning process.

For Kubernetes or CaaS offerings, there can be three flavors:

* **NCP-managed K8s**: The NCP operates a dedicated K8s cluster for each
  tenant, handling control plane lifecycle, upgrades, and scaling.
  Tenants receive kubeconfig access to their cluster.
* **ISV-managed K8s**: An ISV management platform that provisions and
  manages K8s clusters on cloud-native provisioned infrastructure or
  infrastructure managed by NVIDIA Infra Controller. The ISV handles multi-tenant
  orchestration, providing both operator and tenant portals.
* **Tenant-managed K8s**: The NCP provides bare metal or VMs; tenants
  install and manage their own K8s clusters. This model offers maximum
  flexibility but shifts operational burden to tenants.

Kubernetes satisfies two main use cases for GPU service providers:

* **Hosting K8s-native control planes that use Custom Resource Definitions (CRDs) for APIs**: Operators extend Kubernetes'
  functionality, leveraging CRDs to manage services. The SDN and SDS
  controllers are typically hosted by k8s-native control planes. The
  overall orchestration should be a dedicated per-tenant k8s cluster or
  a similar use-case-appropriate isolation mechanism.
* **Hosting observable, serviceable, and secure GPU workloads (training and inference)**:
  * **Observability**: Cloud Native tools such as OpenTelemetry and
    Prometheus are critical for monitoring load, access rates, response
    latency, and model performance to detect drift and ensure
    reliability.
  * **Serviceability** (Node Health detection and break/fix
    remediation): Kubernetes nodes operate with an Unschedulable=false
    state to indicate readiness. NCPs are expected to support break-fix
    procedures and sparing strategies, whether for individual GPU nodes
    or entire rack-scale fault domains.
  * **Security** (DevSecOps and Policy Enforcement): Containerizing AI
    models as OCI artifacts enables software supply chain best practices
    including artifact signing, validation, and attestation. Policy
    enforcement tools like Kyverno ensure containerized workloads run
    with least privilege and comply with security policies.

Kubernetes runs on compute resources provisioned by the infrastructure
(IaaS) layer. For managed-K8s offering, the NCP or ISV operates the K8s
control plane and provides tenants with kubeconfig access to their
dedicated clusters.

### Capabilities Required

A GPU-optimized managed-Kubernetes has the following capabilities:

* Abstracts k8s CP nodes such that cloud consumers need only specify the
  required high-availability and/or scalability for the k8s control
  plane.
* Supports k8s version 1.34 or later, which enables Dynamic Resource
  Allocation (DRA) for flexible GPU sharing and allocation, plus support
  for IMEX in rack-scale GPU clusters.
* Allows cloud consumers to bring their own GPU-optimized node OS, or
  offers to provide one that integrates with other managed services.
* Supports managed-node groups and/or cluster node autoscaling such as
  Karpenter.sh.
* Provides industry-standard integration with high-performance storage
  (CSI) and networking (CNI) optimized for the cloud provider's storage
  and network services.
* Supports topology discovery for k8s worker nodes in rail-aligned
  clusters. This enables topology-aware gang-scheduling for distributed
  training and disaggregated inference workloads.
* Aligned with the Cloud Native Computing Foundation, specifically:
  * Complies with CNCF certification for K8s distribution
  * Complies with CNCF emerging Cloud Native AI conformance initiative

### K8s-Native ML/AI Frameworks and Tools

[NVIDIA AI
Enterprise](https://docs.nvidia.com/ai-enterprise/index.html) stands out
as a prime example of a CNAI tool for MLOps and agentic apps, leveraging
Kubernetes principles like declarative APIs, composability, and
portability. It implements individual microservices for each stage of
the ML lifecycle, using components like the Kubeflow Training Operator
for distributed training, and k8s-native Dynamo for model serving.

For efficient ML/AI, advanced scheduling support is evolving through
projects like NVIDIA KAI and Grove. KEDA (Kubernetes Event Driven
Autoscaling) is well-suited for event-driven hosting, optimizing
resource usage and cost. Furthermore, general-purpose distributed
computing engines such as Ray, along with KubeRay, provide a unified ML
platform that complements the Cloud Native ecosystem by focusing on
computation, collaborating extensively with Kubernetes communities to
enhance production ML pipeline performance and reduce inference costs
significantly. The integration of JupyterLab with Kubernetes allows AI
practitioners to iterate more quickly within a familiar environment,
abstracting complex Kubernetes details.

### Kubernetes Architecture for ML/AI

The architectural strength of Kubernetes for ML/AI lies in its inherent
ability to orchestrate complex, distributed workflows efficiently. GPU
service providers must support the distinct needs of Generative AI,
which demands extremely high computational power from specialized
hardware, massive and diverse datasets for training, complex iterative
training, and highly scalable and elastic infrastructure for model
serving.

Dynamic Resource Allocation (DRA), now a GA API in K8s v1.34, offers
greater flexibility in managing specialized hardware.

Kubernetes manages the allocation of GPUs (with RoCE NICs) that require
virtualization support, drivers, and sharing capabilities. NVIDIA®
technologies such as Multi-Instance GPU (MIG) and Multi-Process Service
(MPS) further enhance GPU utilization by partitioning a single physical
GPU into independent instances, thus allowing efficient sharing of
resources. For optimal cost control and sustainability, optimal resource
sizing and reactive scheduling are paramount, especially for expensive
and highly contested GPU accelerators.

The Kubernetes scheduler is evolving to better support GPU sharing and
multi-tenancy, with APIs for remote (non-node-local) resources such as
Multi-Node NVLink accessed through IMEX.

High-performance storage is essential for Generative AI, handling
diverse data types and providing low-latency access. High-bandwidth and
low-latency networking is crucial for data transfer and model
synchronization during distributed training. Kubernetes Container
Storage and Network interfaces offer standard interfaces to abstract
integration.

By embracing Kubernetes, CSPs and NCPs can offer a robust, scalable, and
cost-efficient platform that addresses the unique and evolving demands
of the ML/AI landscape, providing a strategic advantage in a competitive
market.

## NVIDIA Software for Container-as-a-Service

NVIDIA provides container tools and Kubernetes operators to enable GPU
workloads:

### Container Platform

The [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/index.html) is a collection of libraries and utilities
enabling users to build and run GPU-accelerated containers. It provides
the foundation for GPU access in containerized environments.

### Kubernetes Operators

NVIDIA provides Kubernetes operators that extend cluster functionality
through Custom Resource Definitions (CRDs):

**NVIDIA Kubernetes Operators**

| Kubernetes Operators | Function                                                                                                        |
| -------------------- | --------------------------------------------------------------------------------------------------------------- |
| GPU operator         | Automates GPU driver, runtime, and device plugin deployment                                                     |
| Network Operator     | Provisions RDMA, SR-IOV, and GPUDirect networking resources                                                     |
| DPU Operator         | Manages BlueField DPU lifecycle, firmware, and DOCA runtime. Coordinates DPU provisioning with Network Operator |
| NIM Operator         | Automates deployment and lifecycle of NVIDIA NIM™ microservices for generative AI inference workloads           |

These operators enable Kubernetes-native features such as Dynamic
Resource Allocation (DRA) for GPU sharing and IMEX for multi-node NVLink
scheduling. For detailed descriptions of each component, see [Part 2: NVIDIA Software for Container as a Service](/dsx/part-2-software-components/nvidia-software-for-container-as-a-service).