Introduction#

This document outlines a reference architecture for a cost-effective, performant, multi-tenant Kubernetes-as-a-Service to build out an infrastructure stack for large scale AI training and inference workloads.

The reference architecture is designed to provide each tenant a Kubernetes cluster to deploy AI workloads. With built-in concurrent multi-tenancy through physical host sharing, it improves resource utilization, thereby optimizing cost. It incorporates NVIDIA’s best practices for deploying performant and cost-effective AI inference and training workloads.

This reference software architecture provides guidance to optimize performance for AI inference and training in a virtualized environment where workloads could be deployed in either shared or dedicated physical hosts. NVIDIA Cloud Partners (NCPs) can work with their preferred vendors or open source tools to implement security and workload isolation best practices to operate a concurrent multi-tenant private cloud that leverages the performance-optimized stack outlined in this reference architecture.

NVIDIA will certify the performance of third party solutions so that NCPs can confidently choose their partner of choice.

Key NVIDIA AI Enterprise components that are included in this reference architecture include:

NVIDIA NIM, a set of easy-to-use microservices designed for secure, reliable deployment of high-performance AI model inferencing across clouds, data centers, and workstations.
NVIDIA NeMo microservices provide end-to-end workflow for model customization, enabling enterprises to adapt large language models to their specific needs efficiently.
NVIDIA Kubernetes Operators
GPU Operator standardizes GPU management in Kubernetes and enables better GPU performance, utilization, and telemetry.
Network Operator simplifies the provisioning and management of NVIDIA networking resources in a Kubernetes cluster.
NIM Operator automates the deployment, configuration, and management of NVIDIA NIM microservices for Generative AI applications.
NVIDIA GPU Drivers that allow the GPU to run on Kubernetes.
NVIDIA DOCA Software, DOCA-OFED drivers and DOCA acceleration libraries and services to enable accelerated networking for AI workloads.
NVIDIA Cloud Functions (NVCF), a serverless API that allows users to deploy and manage AI workloads on GPUs, providing scalability, security, and reliability, accessible via HTTP polling, streaming, or gRPC protocols. Kubernetes integration can be achieved with the NVIDIA Cluster Agent (NVCA).