AKS (Azure Kubernetes Service)#

Added in version 4.1.

Overview#

Azure AKS is a managed Kubernetes service to run Kubernetes in the Azure cloud. NVIDIA AI Enterprise, the end-to-end software of the NVIDIA AI platform, is supported to run on AKS. In the cloud, Azure AKS automatically manages the availability and scalability of the Kubernetes control plane nodes responsible for scheduling containers, managing application availability, storing cluster data, and other key tasks. This guide provides details for deploying and running NVIDIA AI Enterprise on AKS clusters with GPU Accelerated nodes.

Note

The NVIDIA Terraform Modules offer an easy way to deploy Managed Kubernetes clusters that can be supported by NVIDIA AI Enterprise when used with supported OS and GPU Operator versions.

Prerequisites#

NVIDIA AI Enterprise License via BYOL or a Private Offer
Azure CLI
Install Helm
Azure Owner/Admin access to create AKS resources

Create Azure Kubernetes Service (AKS) Cluster#

Login to Azure CLI:

az login --use-device-code

Navigate to Azure Portal to find the Azure Subscription ID, Go to Subscription and you can find the Subscription ID right next to the Subscription name.

az account set --subscription <Subscription ID>

Install Azure CLI plugins which will install kubectl.

az aks install-cli

Create resource group with location where you planning to create a AKS cluster. https://learn.microsoft.com/en-us/azure/aks/availability-zones

az group create --name nvidia-aks-cluster-rg --location <location>

Example:

az group create --name nvidia-aks-cluster-rg --location westus2

Create AKS Cluster with updated <location> below.

az aks create -g nvidia-aks-cluster-rg -n aks-nvaie -l <location> --enable-node-public-ip --node-count 1 --generate-ssh-keys --node-vm-size Standard_NC4as_T4_v3 --nodepool-tags SkipGPUDriverInstall=true

Note

In Above we choose T4 node as an example, but you choose the any NVIDIA GPU nodes as listed here, https://learn.microsoft.com/en-us/azure/virtual-machines/sizes-gpu

Verify Kubernetes Service has been created on Azure Portal.

Now get the Azure Kubernetes Cluster Credentials, Navigate to Kubernete Services and choose the cluster and you can see Connect button as shown below, when you click on Connect you will see how you can get the credentials as shown below.

Run the below command to get the kubeconfig credentials to the local system.

az aks get-credentials --resource-group nvidia-aks-cluster-rg --name aks-nvaie

Run the below command to verify the node information.

kubectl get nodes -o wide

Example output result

NAME                                STATUS   ROLES   AGE     VERSION   INTERNAL-IP   EXTERNAL-IP    OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
aks-nodepool1-21142003-vmss000000   Ready    agent   2m50s   v1.26.6   10.224.0.4    20.114.32.13   Ubuntu 22.04.3 LTS   5.15.0-1049-azure   containerd://1.7.5-1

Run the below command to verify the all pods are running:

kubectl get pods -A
NAMESPACE     NAME                                  READY   STATUS    RESTARTS   AGE
kube-system   azure-ip-masq-agent-j4bgl             1/1     Running   0          6m17s
kube-system   cloud-node-manager-nfw89              1/1     Running   0          6m17s
kube-system   coredns-76b9877f49-bv74d              1/1     Running   0          5m19s
kube-system   coredns-76b9877f49-rrlj7              1/1     Running   0          6m36s
kube-system   coredns-autoscaler-85f7d6b75d-vt47r   1/1     Running   0          6m36s
kube-system   csi-azuredisk-node-bxcd2              3/3     Running   0          6m17s
kube-system   csi-azurefile-node-vlnqm              3/3     Running   0          6m17s
kube-system   konnectivity-agent-75fb8dbd69-5lw87   1/1     Running   0          6m36s
kube-system   konnectivity-agent-75fb8dbd69-xxgkb   1/1     Running   0          6m36s
kube-system   kube-proxy-9q55z                      1/1     Running   0          6m17s
kube-system   metrics-server-c456c67cb-f72kc        2/2     Running   0          5m15s
kube-system   metrics-server-c456c67cb-lw8pn        2/2     Running   0          5m15s

Deploy the GPU Operator#

Run the below commands to create a namespace on AKS cluster.

kubectl create ns gpu-operator

Add the Helm repo and update with the below commands.

helm repo add nvidia https://helm.ngc.nvidia.com/nvaie --username='$oauthtoken' --password=<YOUR API KEY>
helm repo update

Create a NGC Secret with your NGC API key on “gpu-operator” namespace as per below.

kubectl create secret docker-registry ngc-secret --docker-server=nvcr.io/nvaie --docker-username=\$oauthtoken --docker-password=<NGC-API-KEY> --docker-email=<your_email_id> -n gpu-operator

Create an empty gridd.conf file, then create a configmap with NVIDIA vGPU Licence token file as per below.

kubectl create configmap licensing-config   -n gpu-operator --from-file=./client_configuration_token.tok --from-file=./gridd.conf

Install the GPU Operator from NGC Catalog with License token and driver repository.

helm install gpu-operator nvaie/gpu-operator-4-0 --version 23.6.1 --set driver.repository=nvcr.io/nvaie,driver.licensingConfig.configMapName=licensing-config --namespace gpu-operator

Verify the GPU Operator Installation#

Verify the NVIDIA GPU Driver loaded with below command.

kubectl get pods -n gpu-operator

NAME                                                          READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-tgk44                                   1/1     Running     0          5m11s
gpu-operator-74759dfc4b-kk5ks                                 1/1     Running     0          6m13s
gpu-operator-node-feature-discovery-gc-7c8b8d65fd-kf4dz       1/1     Running     0          6m13s
gpu-operator-node-feature-discovery-master-56874d94b9-7qdmz   1/1     Running     0          6m13s
gpu-operator-node-feature-discovery-worker-plcg5              1/1     Running     0          6m13s
nvidia-container-toolkit-daemonset-48p26                      1/1     Running     0          5m11s
nvidia-cuda-validator-pxbt4                                   0/1     Completed   0          97s
nvidia-dcgm-exporter-qrwdp                                    1/1     Running     0          5m11s
nvidia-device-plugin-daemonset-9grxf                          1/1     Running     0          5m11s
nvidia-driver-daemonset-696m4                                 1/1     Running     0          5m36s
nvidia-operator-validator-zqvgh                               1/1     Running     0          5m11s

kubectl exec -it -n gpu-operator nvidia-driver-daemonset-696m4 -- nvidia-smi

Wed Oct 25 16:25:42 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.01             Driver Version: 535.129.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       On  | 00000001:00:00.0 Off |                  Off |
| N/A   34C    P8              15W /  70W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Note

nvidia-driver-daemonset-xxxxx will be different within your own environment for the above command to verify the NVIDIA vGPU Driver.

Run Sample NVIDIA AI Enterprise Container#

Create a docker-registry secret. This will be used in a custom yaml to pull containers from the NGC Catalog.

kubectl create secret docker-registry regcred --docker-server=nvcr.io/nvaie --docker-username=\$oauthtoken --docker-password=<YOUR_NGC_KEY> --docker-email=<your_email_id> -n default

Create a custom yaml file to deploy an NVIDIA AI Enterprise Container and run sample training code.

nano pytoch-mnist.yaml

Paste the below contents into the file and save.

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: pytorch-mnist
  labels:
    app: pytorch-mnist
spec:
  replicas: 1
  selector:
    matchLabels:
      app: pytorch-mnist
  template:
    metadata:
      labels:
        app: pytorch-mnist
    spec:
      containers:
        - name: pytorch-container
          image: nvcr.io/nvaie/pytorch-2-0:22.02-nvaie-2.0-py3
          command:
            - python
          args:
            - /workspace/examples/upstream/mnist/main.py
          resources:
            requests:
              nvidia.com/gpu: 1
            limits:
              nvidia.com/gpu: 1
      imagePullSecrets:
        - name: regcred

Check the status of the pod.

1kubectl get pods

View the output of the sample mnist training job.

kubectl logs -l app=pytorch-mnist

The output will look similar to this.

Train Epoch: 4 [55680/60000 (93%)]  Loss: 0.007223
Train Epoch: 4 [56320/60000 (94%)]  Loss: 0.029804
Train Epoch: 4 [56960/60000 (95%)]  Loss: 0.018922
Train Epoch: 4 [57600/60000 (96%)]  Loss: 0.037932
Train Epoch: 4 [58240/60000 (97%)]  Loss: 0.044342
Train Epoch: 4 [58880/60000 (98%)]  Loss: 0.046980
Train Epoch: 4 [59520/60000 (99%)]  Loss: 0.057098


Test set: Average loss: 0.0319, Accuracy: 9897/10000 (99%)

Delete the AKS Cluster#

Run the below command to delete the AKS cluster.

az aks delete --resource-group aks-rg --name aks-nvaie --yes