Container-as-a-Service#

Kubernetes#

Kubernetes (K8s) is used as the workload orchestration engine, and this RA is built around a K8s architecture.

Kubernetes is a container orchestration tool that has become a de facto standard for operating cloud environments at scale due to its flexibility and scalability. As defined by the Cloud Native Computing Foundation (CNCF), Cloud Native technologies enable organizations to operate scalable applications within modern, dynamic environments using containers, service meshes, microservices, immutable infrastructure, and declarative APIs. This approach yields resilient, manageable, and observable systems, facilitating predictable adaptability with minimal operational overhead. For AI infrastructure providers, Kubernetes offers the foundational abstraction that can support the demanding scale of ML/AI inference and training workloads.

Kubernetes is utilized for both training and deploying AI models when they are packaged into containers using Open Container Initiative compliant standards such as NVIDIA’s extensible inference microservices (NIM). Containerization is particularly vital for AI workloads because different models require distinct and potentially conflicting dependencies, so isolating these dependencies within containers provides flexibility in model deployments.

The default K8s uses the containerd runtime for container orchestration. The operator may further choose to support the deployment option of running certain workloads that may require the strictest isolation in dedicated nodes on dedicated physical hosts.

The tenant’s K8s cluster is configured to retrieve code and data from NVIDIA NGC™ Container Registry that the respective tenant has access to. For this, the tenant provides their NVIDIA AI Enterprise access credentials to the Operator’s K8s cluster provisioning process.

For Kubernetes or CaaS offerings, there can be three flavors:

  • NCP-managed K8s: The NCP operates a dedicated K8s cluster for each tenant, handling control plane lifecycle, upgrades, and scaling. Tenants receive kubeconfig access to their cluster.

  • ISV-managed K8s: An ISV management platform that provisions and manages K8s clusters on cloud-native provisioned infrastructure or infrastructure managed by NVIDIA Bare Metal Manager. The ISV handles multi-tenant orchestration, providing both operator and tenant portals.

  • Tenant-managed K8s: The NCP provides bare metal or VMs; tenants install and manage their own K8s clusters. This model offers maximum flexibility but shifts operational burden to tenants.

Kubernetes satisfies two main use cases for GPU service providers:

  • Hosting K8s-native control planes that use Custom Resource Definitions (CRDs) for APIs: Operators extend Kubernetes’ functionality, leveraging CRDs to manage services. The SDN and SDS controllers are typically hosted by k8s-native control planes. The overall orchestration should be a dedicated per-tenant k8s cluster or a similar use-case-appropriate isolation mechanism.

  • Hosting observable, serviceable, and secure GPU workloads (training and inference):

    • Observability: Cloud Native tools such as OpenTelemetry and Prometheus are critical for monitoring load, access rates, response latency, and model performance to detect drift and ensure reliability.

    • Serviceability (Node Health detection and break/fix remediation): Kubernetes nodes operate with an Unschedulable=false state to indicate readiness. NCPs are expected to support break-fix procedures and sparing strategies, whether for individual GPU nodes or entire rack-scale fault domains.

    • Security (DevSecOps and Policy Enforcement): Containerizing AI models as OCI artifacts enables software supply chain best practices including artifact signing, validation, and attestation. Policy enforcement tools like Kyverno ensure containerized workloads run with least privilege and comply with security policies.

Kubernetes runs on compute resources provisioned by the infrastructure (IaaS) layer. For managed-K8s offering, the NCP or ISV operates the K8s control plane and provides tenants with kubeconfig access to their dedicated clusters.

Capabilities Required#

A GPU-optimized managed-Kubernetes has the following capabilities:

  • Abstracts k8s CP nodes such that cloud consumers need only specify the required high-availability and/or scalability for the k8s control plane.

  • Supports k8s version 1.34 or later, which enables Dynamic Resource Allocation (DRA) for flexible GPU sharing and allocation, plus support for IMEX in rack-scale GPU clusters.

  • Allows cloud consumers to bring their own GPU-optimized node OS, or offers to provide one that integrates with other managed services.

  • Supports managed-node groups and/or cluster node autoscaling such as Karpenter.sh.

  • Provides industry-standard integration with high-performance storage (CSI) and networking (CNI) optimized for the cloud provider’s storage and network services.

  • Supports topology discovery for k8s worker nodes in rail-aligned clusters. This enables topology-aware gang-scheduling for distributed training and disaggregated inference workloads.

  • Aligned with the Cloud Native Computing Foundation, specifically:

    • Complies with CNCF certification for K8s distribution

    • Complies with CNCF emerging Cloud Native AI conformance initiative

K8s-Native ML/AI Frameworks and Tools#

NVIDIA AI Enterprise stands out as a prime example of a CNAI tool for MLOps and agentic apps, leveraging Kubernetes principles like declarative APIs, composability, and portability. It implements individual microservices for each stage of the ML lifecycle, using components like the Kubeflow Training Operator for distributed training, and k8s-native Dynamo for model serving.

For efficient ML/AI, advanced scheduling support is evolving through projects like NVIDIA KAI and Grove. KEDA (Kubernetes Event Driven Autoscaling) is well-suited for event-driven hosting, optimizing resource usage and cost. Furthermore, general-purpose distributed computing engines such as Ray, along with KubeRay, provide a unified ML platform that complements the Cloud Native ecosystem by focusing on computation, collaborating extensively with Kubernetes communities to enhance production ML pipeline performance and reduce inference costs significantly. The integration of JupyterLab with Kubernetes allows AI practitioners to iterate more quickly within a familiar environment, abstracting complex Kubernetes details.

Kubernetes Architecture for ML/AI#

The architectural strength of Kubernetes for ML/AI lies in its inherent ability to orchestrate complex, distributed workflows efficiently. GPU service providers must support the distinct needs of Generative AI, which demands extremely high computational power from specialized hardware, massive and diverse datasets for training, complex iterative training, and highly scalable and elastic infrastructure for model serving.

Dynamic Resource Allocation (DRA), now a GA API in K8s v1.34, offers greater flexibility in managing specialized hardware.

Kubernetes manages the allocation of GPUs (with RoCE NICs) that require virtualization support, drivers, and sharing capabilities. NVIDIA® technologies such as Multi-Instance GPU (MIG) and Multi-Process Service (MPS) further enhance GPU utilization by partitioning a single physical GPU into independent instances, thus allowing efficient sharing of resources. For optimal cost control and sustainability, optimal resource sizing and reactive scheduling are paramount, especially for expensive and highly contested GPU accelerators.

The Kubernetes scheduler is evolving to better support GPU sharing and multi-tenancy, with APIs for remote (non-node-local) resources such as Multi-Node NVLink accessed through IMEX.

High-performance storage is essential for Generative AI, handling diverse data types and providing low-latency access. High-bandwidth and low-latency networking is crucial for data transfer and model synchronization during distributed training. Kubernetes Container Storage and Network interfaces offer standard interfaces to abstract integration.

By embracing Kubernetes, CSPs and NCPs can offer a robust, scalable, and cost-efficient platform that addresses the unique and evolving demands of the ML/AI landscape, providing a strategic advantage in a competitive market.

NVIDIA Software for Container-as-a-Service#

NVIDIA provides container tools and Kubernetes operators to enable GPU workloads:

Container Platform#

The NVIDIA Container Toolkit is a collection of libraries and utilities enabling users to build and run GPU-accelerated containers. It provides the foundation for GPU access in containerized environments.

Kubernetes Operators#

NVIDIA provides Kubernetes operators that extend cluster functionality through Custom Resource Definitions (CRDs):

NVIDIA Kubernetes Operators#

Kubernetes Operators

Function

GPU operator

Automates GPU driver, runtime, and device plugin deployment

Network Operator

Provisions RDMA, SR-IOV, and GPUDirect networking resources

DPU Operator

Manages BlueField DPU lifecycle, firmware, and DOCA runtime. Coordinates DPU provisioning with Network Operator

NIM Operator

Automates deployment and lifecycle of NVIDIA NIM™ microservices for generative AI inference workloads

These operators enable Kubernetes-native features such as Dynamic Resource Allocation (DRA) for GPU sharing and IMEX for multi-node NVLink scheduling. For detailed descriptions of each component, see Part 2: NVIDIA Software for Container as a Service.