Azure#
This guide covers deploying NIM LLM on Microsoft Azure, which supports two deployment paths:
Deployment Path |
Description |
|---|---|
AKS (Azure Kubernetes Service) |
Self-managed Kubernetes deployment using the NIM LLM Helm chart. Provides full control over infrastructure, including GPU node pools and persistent storage. |
Microsoft Foundry |
Managed deployment using Azure ML Managed Online Endpoints. Simplifies infrastructure management with built-in health checks. |
AKS Deployment#
To deploy NIM LLM on Azure AKS, follow these steps to create an AKS cluster with GPU nodes and prepare the environment for running NIM LLM workloads. This section will walk you through prerequisites, cluster creation, and initial setup.
Prerequisites#
Install the following tools before proceeding:
Tip
If you use preview AKS features, also install the aks-preview extension.
Log in to Azure:
az login --use-device-code
You also need an NGC API key with access to NIM LLM container images and Helm charts.
Create an AKS Cluster with GPU Nodes#
Complete the following steps to create the AKS cluster, add a GPU node pool, and get the cluster credentials.
Create a resource group and AKS cluster:
export RESOURCE_GROUP="nim-resource-group" export REGION="southcentralus" export AKS_NAME="nim-cluster" az group create --name $RESOURCE_GROUP --location $REGION az aks create --resource-group $RESOURCE_GROUP --name $AKS_NAME \ --location $REGION --generate-ssh-keys
For detailed steps, refer to the AKS quickstart guide.
Add a GPU node pool. Skip automatic GPU driver installation so the NVIDIA GPU Operator manages drivers later in this guide.
az aks nodepool add --resource-group $RESOURCE_GROUP --cluster-name $AKS_NAME \ --name gpupool --node-count 1 --node-vm-size Standard_ND96amsr_A100_v4 \ --gpu-driver none --node-osdisk-size 2048 --max-pods 110
Note
Azure CLI 2.72.2 or later is required for
--gpu-driver none. On older CLI versions, use--skip-gpu-driver-installinstead (deprecated after August 2025).Tip
Choose a GPU-enabled VM size that matches your model requirements. For supported options, refer to GPU workloads in AKS.
Get the cluster credentials:
az aks get-credentials --resource-group $RESOURCE_GROUP --name $AKS_NAME \ --overwrite-existing
Install the NVIDIA GPU Operator#
Install the NVIDIA GPU Operator after the AKS cluster is ready.
Add the NVIDIA Helm repository:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia --pass-credentials
Update the local Helm repository index:
helm repo update
Install the GPU Operator:
helm install --create-namespace --namespace gpu-operator \ nvidia/gpu-operator --wait --generate-name
For more information, refer to the NVIDIA GPU Operator documentation.
Create Kubernetes Secrets#
Create the namespace and secrets that the Helm deployment uses.
Set the required environment variables:
export NGC_API_KEY="${YOUR_NGC_API_KEY}" export NAMESPACE="nim-llm" export IMAGE_PULL_SECRET="ngc-secret"
Create the namespace:
kubectl create namespace $NAMESPACE
Create the NGC image pull secret:
kubectl create secret docker-registry $IMAGE_PULL_SECRET \ --namespace $NAMESPACE \ --docker-server=nvcr.io \ --docker-username='$oauthtoken' \ --docker-password="$NGC_API_KEY"
Create the NGC API key secret:
kubectl create secret generic ngc-api \ --namespace $NAMESPACE \ --from-literal=NGC_API_KEY="$NGC_API_KEY"
Optional: For gated Hugging Face models, create an additional secret.
Set the Hugging Face token:
export HF_TOKEN="${YOUR_HF_TOKEN}"
Create the secret:
kubectl create secret generic hf-token \ --namespace $NAMESPACE \ --from-literal=HF_TOKEN="$HF_TOKEN"
Deploy NIM LLM with Helm#
To deploy NVIDIA NIM for LLMs with Helm, follow these steps:
Fetch the Helm chart from NGC:
export NIM_LLM_CHART_VERSION="1.0.0" # Set to your desired chart version helm fetch https://helm.ngc.nvidia.com/nim/charts/nim-llm-${NIM_LLM_CHART_VERSION}.tgz \ --username='$oauthtoken' --password=$NGC_API_KEY
Optional: View the default chart values to understand available configuration options:
helm show values nim-llm-${NIM_LLM_CHART_VERSION}.tgz
Tip
For help choosing the right model configuration for your
values.yaml, refer to Model Profiles and Selection.Deploy using a custom values file:
helm install my-nim nim-llm-${NIM_LLM_CHART_VERSION}.tgz \ --namespace $NAMESPACE \ -f path/to/your/custom-values.yaml
Verify the Deployment#
Complete the following steps to confirm that the service is reachable and serving inference requests.
Get the service endpoint:
kubectl -n $NAMESPACE get svc -l app.kubernetes.io/name=nim-llm
Check the health endpoint (set
NIM_EXTERNAL_IPfrom the service EXTERNAL-IP):export NIM_EXTERNAL_IP=$(kubectl -n $NAMESPACE get svc -l app.kubernetes.io/name=nim-llm -o jsonpath='{.items[0].status.loadBalancer.ingress[0].ip}') curl -s "http://${NIM_EXTERNAL_IP}:8000/v1/health/ready"
Send an inference request:
curl -X POST "http://${NIM_EXTERNAL_IP}:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ -d '{ "model": "meta/llama3.1-8b-instruct", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 128 }'
Microsoft Foundry Deployment#
To deploy NIM on Microsoft Foundry, follow these steps to provision the necessary Azure Machine Learning resources and prepare your environment for running NIM workloads. This section will walk you through prerequisites, workspace setup, and deployment of NIM as a managed online endpoint.
Prerequisites#
Before deploying NIM to Microsoft Foundry, make sure you have the following:
The Azure CLI and Azure ML extension:
az extension add -n ml -y
An Azure Machine Learning workspace:
az ml workspace create --name ${WORKSPACE_NAME} --resource-group ${RESOURCE_GROUP}
An NGC API key for pulling NIM container images and downloading model artifacts.
-
pip install azure-ai-ml azure-identity
Note
Some Azure regions have limited GPU quota. If you encounter quota errors, refer to Troubleshoot online endpoints: OutOfQuota.
Deploy with the Python SDK#
The following steps deploy a NIM container as an Azure ML managed online endpoint.
Tip
The nim_model_profile variable determines which model configuration is used. For help choosing the right profile, refer to Model Profiles and Selection.
Create the ML client and endpoint:
from azure.ai.ml import MLClient from azure.ai.ml.entities import ( ManagedOnlineEndpoint, ManagedOnlineDeployment, Environment, OnlineRequestSettings, ProbeSettings, ) from azure.identity import DefaultAzureCredential subscription_id = "<your-subscription-id>" resource_group = "<your-resource-group>" workspace_name = "<your-workspace-name>" endpoint_name = "<your-endpoint-name>" acr_image = "<your-acr-or-ngc-nim-image-uri>" ngc_api_key = "<your-ngc-api-key>" nim_model_profile = "<your-nim-model-profile>" instance_type = "Standard_NC24ads_A100_v4" client = MLClient( credential=DefaultAzureCredential(), subscription_id=subscription_id, resource_group_name=resource_group, workspace_name=workspace_name, ) endpoint = ManagedOnlineEndpoint( name=endpoint_name, description=f"NIM deployment: {acr_image}", auth_mode="key", ) client.online_endpoints.begin_create_or_update(endpoint).result()
Define the environment and deployment:
env = Environment( name=f"nim-env-{endpoint_name}", image=acr_image, inference_config={ "liveness_route": {"port": 8000, "path": "/v1/health/live"}, "readiness_route": {"port": 8000, "path": "/v1/health/ready"}, "scoring_route": {"port": 8000, "path": "/"}, }, ) deployment = ManagedOnlineDeployment( name="nim", endpoint_name=endpoint_name, environment=env, instance_type=instance_type, instance_count=1, environment-variables={ "NGC_API_KEY": ngc_api_key, "NIM_MODEL_PROFILE": nim_model_profile, }, request_settings=OnlineRequestSettings( request_timeout_ms=90000, max_concurrent_requests_per_instance=1, ), liveness_probe=ProbeSettings( initial_delay=600, period=30, timeout=10, failure_threshold=30 ), readiness_probe=ProbeSettings( initial_delay=600, period=30, timeout=10, failure_threshold=30 ), ) client.online_deployments.begin_create_or_update(deployment).result()
Route traffic to the deployment:
endpoint = client.online_endpoints.get(endpoint_name) endpoint.traffic = {"nim": 100} client.online_endpoints.begin_create_or_update(endpoint).result()
Retrieve the endpoint URL and key:
endpoint_info = client.online_endpoints.get(endpoint_name) keys = client.online_endpoints.get_keys(endpoint_name) endpoint_url = endpoint_info.scoring_uri.rstrip("/") api_key = keys.primary_key
Verify the Deployment#
Send an inference request to the endpoint:
import httpx
headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json",
}
payload = {
"model": "meta/llama3.1-8b-instruct", # or your deployed model name
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 128,
}
with httpx.Client(timeout=60) as http:
response = http.post(
f"{endpoint_url}/v1/chat/completions",
headers=headers,
json=payload,
)
response.raise_for_status()
print(response.json())