NVIDIA AI Enterprise 4.1 or later
Azure AKS is a managed Kubernetes service to run Kubernetes in the Azure cloud. NVIDIA AI Enterprise, the end-to-end software of the NVIDIA AI platform, is supported to run on AKS. In the cloud, Azure AKS automatically manages the availability and scalability of the Kubernetes control plane nodes responsible for scheduling containers, managing application availability, storing cluster data, and other key tasks. This guide provides details for deploying and running NVIDIA AI Enterprise on AKS clusters with GPU Accelerated nodes.
The NVIDIA Terraform Modules offer an easy way to deploy Managed Kubernetes clusters that can be supported by NVIDIA AI Enterprise when used with supported OS and GPU Operator versions.
NVIDIA AI Enterprise License via BYOL or a Private Offer
Helm Install
Azure Owner/Admin access to create AKS resources
Login to Azure CLI:
az login --use-device-code
Navigate to Azure Portal to find the Azure Subscription ID, Go to Subscription and you can find the Subscription ID right next to the Subscription name.

az account set --subscription <Subscription ID>
Install Azure CLI plugins which will install kubectl.
az aks install-cli
Create resource group with location where you planning to create a AKS cluster. https://learn.microsoft.com/en-us/azure/aks/availability-zones
az group create --name nvidia-aks-cluster-rg --location <location>
Example:
az group create --name nvidia-aks-cluster-rg --location westus2
Create AKS Cluster with updated <location>
below.
az aks create -g nvidia-aks-cluster-rg -n aks-nvaie -l <location> --enable-node-public-ip --node-count 1 --generate-ssh-keys --node-vm-size Standard_NC4as_T4_v3 --nodepool-tags SkipGPUDriverInstall=true
In Above we choose T4 node as an example, but you choose the any NVIDIA GPU nodes as listed here, https://learn.microsoft.com/en-us/azure/virtual-machines/sizes-gpu
Verify Kubernetes Service has been created on Azure Portal.

Now get the Azure Kubernetes Cluster Credentials, Navigate to Kubernete Services and choose the cluster and you can see Connect button as shown below, when you click on Connect you will see how you can get the credentials as shown below.

Run the below command to get the kubeconfig credentials to the local system.
az aks get-credentials --resource-group nvidia-aks-cluster-rg --name aks-nvaie
Run the below command to verify the node information.
kubectl get nodes -o wide
Example output result
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
aks-nodepool1-21142003-vmss000000 Ready agent 2m50s v1.26.6 10.224.0.4 20.114.32.13 Ubuntu 22.04.3 LTS 5.15.0-1049-azure containerd://1.7.5-1
Run the below command to verify the all pods are running:
kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system azure-ip-masq-agent-j4bgl 1/1 Running 0 6m17s
kube-system cloud-node-manager-nfw89 1/1 Running 0 6m17s
kube-system coredns-76b9877f49-bv74d 1/1 Running 0 5m19s
kube-system coredns-76b9877f49-rrlj7 1/1 Running 0 6m36s
kube-system coredns-autoscaler-85f7d6b75d-vt47r 1/1 Running 0 6m36s
kube-system csi-azuredisk-node-bxcd2 3/3 Running 0 6m17s
kube-system csi-azurefile-node-vlnqm 3/3 Running 0 6m17s
kube-system konnectivity-agent-75fb8dbd69-5lw87 1/1 Running 0 6m36s
kube-system konnectivity-agent-75fb8dbd69-xxgkb 1/1 Running 0 6m36s
kube-system kube-proxy-9q55z 1/1 Running 0 6m17s
kube-system metrics-server-c456c67cb-f72kc 2/2 Running 0 5m15s
kube-system metrics-server-c456c67cb-lw8pn 2/2 Running 0 5m15s
Run the below commands to create a namespace on AKS cluster.
kubectl create ns gpu-operator
Add the Helm repo and update with the below commands.
helm repo add nvidia https://helm.ngc.nvidia.com/nvaie --username='$oauthtoken' --password=<YOUR API KEY>
helm repo update
Create a NGC Secret with your NGC API key on “gpu-operator” namespace as per below.
kubectl create secret docker-registry ngc-secret --docker-server=nvcr.io/nvaie --docker-username=\$oauthtoken --docker-password=<NGC-API-KEY> --docker-email=<your_email_id> -n gpu-operator
Create an empty gridd.conf
file, then create a configmap with NVIDIA vGPU Licence token file as per below.
kubectl create configmap licensing-config -n gpu-operator --from-file=./client_configuration_token.tok --from-file=./gridd.conf
Install the GPU Operator from Enterprise Catalog with License token and driver repository.
helm install gpu-operator nvaie/gpu-operator-4-0 --version 23.6.1 --set driver.repository=nvcr.io/nvaie,driver.licensingConfig.configMapName=licensing-config --namespace gpu-operator
Verify the NVIDIA GPU Driver loaded with below command.
kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-tgk44 1/1 Running 0 5m11s
gpu-operator-74759dfc4b-kk5ks 1/1 Running 0 6m13s
gpu-operator-node-feature-discovery-gc-7c8b8d65fd-kf4dz 1/1 Running 0 6m13s
gpu-operator-node-feature-discovery-master-56874d94b9-7qdmz 1/1 Running 0 6m13s
gpu-operator-node-feature-discovery-worker-plcg5 1/1 Running 0 6m13s
nvidia-container-toolkit-daemonset-48p26 1/1 Running 0 5m11s
nvidia-cuda-validator-pxbt4 0/1 Completed 0 97s
nvidia-dcgm-exporter-qrwdp 1/1 Running 0 5m11s
nvidia-device-plugin-daemonset-9grxf 1/1 Running 0 5m11s
nvidia-driver-daemonset-696m4 1/1 Running 0 5m36s
nvidia-operator-validator-zqvgh 1/1 Running 0 5m11s
kubectl exec -it -n gpu-operator nvidia-driver-daemonset-696m4 -- nvidia-smi
Wed Oct 25 16:25:42 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.01 Driver Version: 535.129.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 On | 00000001:00:00.0 Off | Off |
| N/A 34C P8 15W / 70W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
nvidia-driver-daemonset-xxxxx
will be different within your own environment for the above command to verify the NVIDIA vGPU Driver.
Create a docker-registry secret. This will be used in a custom yaml to pull containers from the Enterprise Catalog.
kubectl create secret docker-registry regcred --docker-server=nvcr.io/nvaie --docker-username=\$oauthtoken --docker-password=<YOUR_NGC_KEY> --docker-email=<your_email_id> -n default
Create a custom yaml file to deploy an NVIDIA AI Enterprise Container and run sample training code.
nano pytoch-mnist.yaml
Paste the below contents into the file and save.
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: pytorch-mnist
labels:
app: pytorch-mnist
spec:
replicas: 1
selector:
matchLabels:
app: pytorch-mnist
template:
metadata:
labels:
app: pytorch-mnist
spec:
containers:
- name: pytorch-container
image: nvcr.io/nvaie/pytorch-2-0:22.02-nvaie-2.0-py3
command:
- python
args:
- /workspace/examples/upstream/mnist/main.py
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1
imagePullSecrets:
- name: regcred
Check the status of the pod.
kubectl get pods
View the output of the sample mnist training job.
kubectl logs -l app=pytorch-mnist
The output will look similar to this.
Train Epoch: 4 [55680/60000 (93%)] Loss: 0.007223
Train Epoch: 4 [56320/60000 (94%)] Loss: 0.029804
Train Epoch: 4 [56960/60000 (95%)] Loss: 0.018922
Train Epoch: 4 [57600/60000 (96%)] Loss: 0.037932
Train Epoch: 4 [58240/60000 (97%)] Loss: 0.044342
Train Epoch: 4 [58880/60000 (98%)] Loss: 0.046980
Train Epoch: 4 [59520/60000 (99%)] Loss: 0.057098
Test set: Average loss: 0.0319, Accuracy: 9897/10000 (99%)
Run the below command to delete the AKS cluster.
az aks delete --resource-group aks-rg --name aks-nvaie --yes