Your Accelerate Spark 3 LaunchPad instance was deployed using Kubernetes and NVIDIA’s Cloud Native Core, a collection of software to run a cloud-native workload on NVIDIA GPUs. If you are new to Kubernetes, we have provided a primer on Kubernetes in this appendix if you would like to learn more.
Kubernetes is an open-source container orchestration platform that makes the job of a DevOps engineer easier. Applications can be deployed on Kubernetes as logical units which are easy to manage, upgrade and deploy with zero downtime (rolling upgrades) and high availability using replication. NVIDIA AI Enterprise applications are available as containers and can be deployed cloud-native on Kubernetes. Deploying Triton Inference Server on Kubernetes offers these same benefits to AI in the Enterprise. The NVIDIA GPU Operator is leveraged to easily manage GPU resources in the cluster.
The GPU Operator allows administrators of Kubernetes clusters to manage GPU nodes just like CPU nodes in the cluster. The AI practitioner doesn’t concern themselves with installing the GPU Operator, and it is done by the DevOps admin maintaining the cluster. The GPU Operator has been automatically installed on your cluster for this lab.
Since Kubernetes does not run containers directly, it wraps one or more containers into a higher level called Pods. A Kubernetes Pod is a group of one or more containers with shared storage and network resources and a specification for how to run the containers. Pods are also typically managed by a layer of abstraction, the Deployment. Using a Deployment, you don’t have to deal with pods manually. It can create and destroy Pods dynamically. A Kubernetes Deployment manages a set of pods as a replica set.
Multiple replicas of the same pod can be used to provide high availability. Using Kubernetes to deploy a Triton Inference Server offers these same benefits to AI in the Enterprise. Because the deployment has replication, if one of the Triton inference server pods fails, other replica pods which are part of the deployment can still serve the end-user. Rolling updates allow Deployment updates, such as upgrading an application, with zero downtime.
Each Pod gets its IP address. However, in a Deployment, the set of Pods running in one moment in time could be different from the set of Pods running that application a moment later. Because pods/containers fail, Kubernetes will bring them up with a different IP. This leads to a problem: if some set of Pods (call them “backends”) provides functionality to other Pods (call them “frontends”) inside your cluster, how do the frontends find out and keep track of which IP address to connect to, so that the frontend can use the backend part of the workload?
We do that through services. The services inside a Kubernetes cluster maintain static IPs, so you can always point to them, and they will relay the request to the pods.
There are several types of Kubernetes services. They differ in how they expose the service.
ClusterIP: A ClusterIP service exposes the service on a cluster-internal IP. As a result, the service is only reachable from inside the cluster.
NodePort: A NodePort service exposes the services on each Node’s IP at a static port. The NodePort service can be contacted from outside the cluster.
LoadBalancer: A LoadBalancer service exposes the service outside of the cluster.
An application deployed on Kubernetes can have multiple types of pods, services, and deployment objects (microservices). For example, this lab has a training Jupyter notebook pod, a Triton inference server pod, and a client application pod. There are services for each pods, objects for Ingress, NGC secrets, etc. All these individual parts of an application can be neatly packaged into a Helm chart and deployed on a Kubernetes cluster as a single click install. A Helm chart to Kubernetes is what an apt package is to Ubuntu.