Azure#
Overview#
This guide covers deploying NIM LLM on Microsoft Azure, which supports two deployment paths:
Deployment Path |
Description |
|---|---|
AKS (Azure Kubernetes Service) |
Self-managed Kubernetes deployment using the NIM LLM Helm chart. Provides full control over infrastructure, including GPU node pools and persistent storage. |
AI Foundry |
Managed deployment using Azure ML Managed Online Endpoints. Simplifies infrastructure management with built-in health checks. |
AKS Deployment#
Prerequisites#
Install the following tools before proceeding:
Optional: aks-preview extension for preview AKS features
Log in to Azure:
az login --use-device-code
You also need an NGC API key with access to NIM LLM container images and Helm charts.
Create an AKS Cluster with GPU Nodes#
Create a resource group and AKS cluster:
export RESOURCE_GROUP="nim-resource-group" export REGION="southcentralus" export AKS_NAME="nim-cluster" az group create --name $RESOURCE_GROUP --location $REGION az aks create --resource-group $RESOURCE_GROUP --name $AKS_NAME \ --location $REGION --generate-ssh-keys
For detailed steps, refer to the AKS quickstart guide.
Add a GPU node pool. Skip automatic GPU driver installation so the NVIDIA GPU Operator manages drivers later in this guide.
az aks nodepool add --resource-group $RESOURCE_GROUP --cluster-name $AKS_NAME \ --name gpupool --node-count 1 --node-vm-size Standard_ND96amsr_A100_v4 \ --gpu-driver none --node-osdisk-size 2048 --max-pods 110
Note
Azure CLI 2.72.2 or later is required for
--gpu-driver none. On older CLI versions, use--skip-gpu-driver-installinstead (deprecated after August 2025).Tip
Choose a GPU-enabled VM size that matches your model requirements. For supported options, refer to GPU workloads in AKS.
Get the cluster credentials:
az aks get-credentials --resource-group $RESOURCE_GROUP --name $AKS_NAME \ --overwrite-existing
Install the NVIDIA GPU Operator#
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia --pass-credentials
helm repo update
helm install --create-namespace --namespace gpu-operator \
nvidia/gpu-operator --wait --generate-name
For more information, refer to the NVIDIA GPU Operator documentation.
Create Kubernetes Secrets#
export NGC_API_KEY="${YOUR_NGC_API_KEY}"
export NAMESPACE="nim-llm"
export IMAGE_PULL_SECRET="ngc-secret"
kubectl create namespace $NAMESPACE
kubectl create secret docker-registry $IMAGE_PULL_SECRET \
--namespace $NAMESPACE \
--docker-server=nvcr.io \
--docker-username='$oauthtoken' \
--docker-password="$NGC_API_KEY"
kubectl create secret generic ngc-api \
--namespace $NAMESPACE \
--from-literal=NGC_API_KEY="$NGC_API_KEY"
For gated Hugging Face models, create an additional secret:
export HF_TOKEN="${YOUR_HF_TOKEN}"
kubectl create secret generic hf-token \
--namespace $NAMESPACE \
--from-literal=HF_TOKEN="$HF_TOKEN"
Deploy NIM LLM with Helm#
Fetch the Helm chart from NGC:
export NIM_LLM_CHART_VERSION="1.0.0" # Set to your desired chart version helm fetch https://helm.ngc.nvidia.com/nim/charts/nim-llm-${NIM_LLM_CHART_VERSION}.tgz \ --username='$oauthtoken' --password=$NGC_API_KEY
Optional: View the default chart values to understand available configuration options:
helm show values nim-llm-${NIM_LLM_CHART_VERSION}.tgz
Tip
For help choosing the right model configuration for your
values.yaml, refer to Model Profiles and Selection.Deploy using a custom values file:
helm install my-nim nim-llm-${NIM_LLM_CHART_VERSION}.tgz \ --namespace $NAMESPACE \ -f path/to/your/custom-values.yaml
Verify the Deployment#
Get the service endpoint:
kubectl -n $NAMESPACE get svc -l app.kubernetes.io/name=nim-llm
Check the health endpoint (set
NIM_EXTERNAL_IPfrom the service EXTERNAL-IP):export NIM_EXTERNAL_IP=$(kubectl -n $NAMESPACE get svc -l app.kubernetes.io/name=nim-llm -o jsonpath='{.items[0].status.loadBalancer.ingress[0].ip}') curl -s "http://${NIM_EXTERNAL_IP}:8000/v1/health/ready"
Send an inference request:
curl -X POST "http://${NIM_EXTERNAL_IP}:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ -d '{ "model": "meta/llama3.1-8b-instruct", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 128 }'
AI Foundry Deployment#
Prerequisites#
Azure CLI with the Azure ML extension:
az extension add -n ml -y
An Azure Machine Learning workspace. Create one with the CLI:
az ml workspace create --name ${WORKSPACE_NAME} --resource-group ${RESOURCE_GROUP}
An NGC API key with access to NIM LLM container images.
-
pip install azure-ai-ml azure-identity
Note
Some Azure regions have limited GPU quota. If you encounter quota errors, refer to Troubleshoot online endpoints: OutOfQuota.
Deploy with the Python SDK#
The following steps deploy a NIM container as an Azure ML managed online endpoint.
Tip
The nim_model_profile variable determines which model configuration is used. For help choosing the right profile, refer to Model Profiles and Selection.
Create the ML client and endpoint:
from azure.ai.ml import MLClient from azure.ai.ml.entities import ( ManagedOnlineEndpoint, ManagedOnlineDeployment, Environment, OnlineRequestSettings, ProbeSettings, ) from azure.identity import DefaultAzureCredential subscription_id = "<your-subscription-id>" resource_group = "<your-resource-group>" workspace_name = "<your-workspace-name>" endpoint_name = "<your-endpoint-name>" acr_image = "<your-acr-or-ngc-nim-image-uri>" ngc_api_key = "<your-ngc-api-key>" nim_model_profile = "<your-nim-model-profile>" instance_type = "Standard_NC24ads_A100_v4" client = MLClient( credential=DefaultAzureCredential(), subscription_id=subscription_id, resource_group_name=resource_group, workspace_name=workspace_name, ) endpoint = ManagedOnlineEndpoint( name=endpoint_name, description=f"NIM deployment: {acr_image}", auth_mode="key", ) client.online_endpoints.begin_create_or_update(endpoint).result()
Define the environment and deployment:
env = Environment( name=f"nim-env-{endpoint_name}", image=acr_image, inference_config={ "liveness_route": {"port": 8000, "path": "/v1/health/live"}, "readiness_route": {"port": 8000, "path": "/v1/health/ready"}, "scoring_route": {"port": 8000, "path": "/"}, }, ) deployment = ManagedOnlineDeployment( name="nim", endpoint_name=endpoint_name, environment=env, instance_type=instance_type, instance_count=1, environment-variables={ "NGC_API_KEY": ngc_api_key, "NIM_MODEL_PROFILE": nim_model_profile, }, request_settings=OnlineRequestSettings( request_timeout_ms=90000, max_concurrent_requests_per_instance=1, ), liveness_probe=ProbeSettings( initial_delay=600, period=30, timeout=10, failure_threshold=30 ), readiness_probe=ProbeSettings( initial_delay=600, period=30, timeout=10, failure_threshold=30 ), ) client.online_deployments.begin_create_or_update(deployment).result()
Route traffic to the deployment:
endpoint = client.online_endpoints.get(endpoint_name) endpoint.traffic = {"nim": 100} client.online_endpoints.begin_create_or_update(endpoint).result()
Retrieve the endpoint URL and key:
endpoint_info = client.online_endpoints.get(endpoint_name) keys = client.online_endpoints.get_keys(endpoint_name) endpoint_url = endpoint_info.scoring_uri.rstrip("/") api_key = keys.primary_key
Verify the Deployment#
Send an inference request to the endpoint:
import httpx
headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json",
}
payload = {
"model": "meta/llama3.1-8b-instruct", # or your deployed model name
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 128,
}
with httpx.Client(timeout=60) as http:
response = http.post(
f"{endpoint_url}/v1/chat/completions",
headers=headers,
json=payload,
)
response.raise_for_status()
print(response.json())