Kubernetes is an open-source container orchestration platform that makes the job of a DevOps engineer easier. Applications can be deployed on Kubernetes as logical units which are easy to manage, upgrade and deploy with zero downtime (rolling upgrades) and high availability using replication. The NVIDIA GPU operator is leveraged to manage GPU resources in the Kubernetes cluster easily.
Kubernetes stops at the VM guest OS; the underlying NVIDIA GPU hardware is abstracted away, causing resource declaration restrictions to increase complexity. VMware vSphere with Tanzu directly integrates with vSphere, which provides a complete orchestration solution. Tanzu is declarative, so creating and interacting with a GPU-enabled cluster often requires fewer steps than upstream Kubernetes, and it can be done on-demand. Tanzu lets you create and operate Tanzu Kubernetes clusters natively in vSphere.
Helm is an application package manager running on top of Kubernetes. Helm is very similar to what Debian/RPM is for Linux or what JAR/WAR is for Java-based applications. Helm charts help you define, install, and upgrade even the most complex Kubernetes applications. NVIDIA Operators (GPU and Network) are provided as Helm Charts.
The GPU Operator allows DevOps Engineers of Kubernetes clusters to manage GPU nodes just like CPU nodes in the cluster. Instead of providing a special OS image for GPU nodes, administrators can deploy a standard OS image for both CPU and GPU nodes and then rely on the GPU Operator to provide the required software components for GPUs.
The GPU Operator is packaged as a Helm Chart. It installs and manages the lifecycle of software components so GPU accelerated applications can be run on Kubernetes.
The components are as follows:
GPU Feature Discovery labels the worker node based on the GPU specs. This enables customers to select the GPU resources their application requires more granularly.
The NVIDIA AI Enterprise Guest Driver.
Kubernetes Device Plugin, which advertises the GPU to the Kubernetes scheduler.
NVIDIA Container Toolkit allows users to build and run GPU accelerated containers. The toolkit includes a container runtime library and utilities to configure containers to leverage NVIDIA GPUs automatically.
DCGM Monitoring allows monitoring of GPUs on Kubernetes.
How does the NVIDIA GPU Operator help IT Infrastructure Teams?
The NVIDIA GPU Operator enables DevOps teams to manage the lifecycle of GPUs when used with Kubernetes at a Cluster level. There is no need to manage each node individually. Without GPU Operators, infrastructure teams had to manage two operating system images, one for GPU nodes and one CPU node. When using the GPU Operator, infrastructure teams can use a CPU image with GPU worker nodes. It allows customers to run GPU accelerated applications on immutable operating systems. Faster node provisioning is achievable since the GPU Operator has been built to detect newly added GPU accelerated Kubernetes worker nodes. Then automatically installs all software components required to run GPU accelerated applications. The GPU Operator is a single tool to manage all K8s components (GPU Device Plugin, GPU Feature Discovery, GPU Monitoring Tools, NVIDIA Runtime). It is important to note that GPU Operator installs NVIDIA AI Enterprise Guest Driver.
The NVIDIA Network Operator leverages Kubernetes custom resources and the Operator framework to configure fast networking, RDMA, and GPUDirect. The Network Operator’s goal is to install the host networking components required to enable RDMA and GPUDirect in a Kubernetes cluster. It configures a high-speed data path for IO-intensive workloads on a secondary network in each cluster node.