AKS GPU Setup

View as Markdown

Kubernetes Version Requirement

AICR requires Kubernetes 1.34 or later on AKS. This is driven by DRA (Dynamic Resource Allocation), which is included in every AICR recipe.

The core DRA APIs (resource.k8s.io) graduated to GA (stable v1) in Kubernetes 1.34. No AKS-specific feature flag is needed — DRA is enabled out of the box once you’re on 1.34+.

$# Create a cluster on 1.34
$az aks create \
> --resource-group <rg> \
> --name <cluster> \
> --kubernetes-version 1.34 \
> --enable-oidc-issuer \
> --enable-workload-identity \
> --enable-managed-identity \
> --generate-ssh-keys
$
$# Upgrade an existing cluster to 1.34
$az aks upgrade \
> --resource-group <rg> \
> --name <cluster> \
> --kubernetes-version 1.34

You can verify DRA is available after the upgrade:

$kubectl api-resources --api-group=resource.k8s.io

Expected output includes deviceclasses, resourceclaims, resourceclaimtemplates, and resourceslices.

Note: Kubernetes version skipping is not allowed. If your cluster is on 1.32, you must upgrade to 1.33 first, then to 1.34.

Dynamic Resource Allocation (DRA)

All AICR recipes include the nvidia-dra-driver-gpu component, which advertises GPUs via the Kubernetes DRA API instead of the legacy device plugin. DRA provides structured GPU device advertisement, claim-based allocation, and integration with gang scheduling.

Feature Gate Details

Kubernetes VersionDRA StatusFeature Gate
1.26–1.29AlphaDynamicResourceAllocation — off by default
1.30–1.33BetaDynamicResourceAllocation — on by default
1.34+GA / Stableresource.k8s.io/v1 — always enabled, no feature gate needed

On AKS 1.34, DRA is GA. You do not need to pass any custom API server flags or register an AKS preview feature.

CLI Override

You can control DRA settings when bundling:

$# Enable GPU resource advertisement (default)
$aicr bundle -r recipe.yaml --set dradriver:gpuResourcesEnabledOverride=true
$
$# Disable DRA GPU allocation (fall back to device plugin)
$aicr bundle -r recipe.yaml \
> --set dradriver:gpuResourcesEnabledOverride=false \
> --set dradriver:resources.gpus.enabled=false

Device Plugin vs DRA

Both device-plugin and DRA are enabled by default, but only one should be used per node. Using both concurrently causes GPU over-admission — both systems advertise all GPUs independently, so the scheduler may admit more GPU pods than physical GPUs available.

For DRA-only (recommended):

$aicr bundle -r recipe.yaml --set gpuoperator:devicePlugin.enabled=false

For device-plugin-only (legacy):

$aicr bundle -r recipe.yaml \
> --set dradriver:gpuResourcesEnabledOverride=false \
> --set dradriver:resources.gpus.enabled=false

GPU Driver Setup

AKS GPU nodepools install NVIDIA drivers by default. This conflicts with the GPU Operator, which also installs drivers by default. Use one of the approaches below to avoid the conflict.

Create nodepools with --gpu-driver none so AKS skips its driver installation and the GPU Operator handles it:

$az aks nodepool add \
> --cluster-name <cluster> \
> --resource-group <rg> \
> --name gpupool \
> --node-vm-size Standard_ND96isr_H100_v5 \
> --gpu-driver none \
> --node-count 1

No changes to AICR recipes are needed — this is the default configuration.

Standard_ND96isr_H100_v5 is the 8-GPU ND H100 v5 SKU. The AKS Dynamo inference throughput gate (inference-throughput) is a fixed absolute full-node floor calibrated on an 8-GPU H100 node, so this SKU is the supported happy path for that gate. Smaller NCads H100 SKUs (Standard_NC80adis_H100_v5 = 2 GPUs, Standard_NC40ads_H100_v5 = 1 GPU) run fine for deployment but will false-fail the throughput floor; gate on inference-ttft-p99 only on those until the per-GPU normalization in #1254 lands.

Alternative: Use the AKS-Managed Driver

If you prefer the AKS-managed driver (e.g., for driver version pinning by AKS), disable the GPU Operator driver:

$aicr bundle -r recipe.yaml --set gpuoperator:driver.enabled=false

Or add to your values override file:

1driver:
2 enabled: false

References