Azure#

Overview#

This guide covers deploying NIM LLM on Microsoft Azure, which supports two deployment paths:

Deployment Path	Description
AKS (Azure Kubernetes Service)	Self-managed Kubernetes deployment using the NIM LLM Helm chart. Provides full control over infrastructure, including GPU node pools and persistent storage.
AI Foundry	Managed deployment using Azure ML Managed Online Endpoints. Simplifies infrastructure management with built-in health checks.

AKS Deployment#

Prerequisites#

Install the following tools before proceeding:

Azure CLI
Optional: aks-preview extension for preview AKS features
kubectl
Helm

Log in to Azure:

az login --use-device-code

You also need an NGC API key with access to NIM LLM container images and Helm charts.

Create an AKS Cluster with GPU Nodes#

Create a resource group and AKS cluster:

export RESOURCE_GROUP="nim-resource-group"
export REGION="southcentralus"
export AKS_NAME="nim-cluster"

az group create --name $RESOURCE_GROUP --location $REGION
az aks create --resource-group $RESOURCE_GROUP --name $AKS_NAME \
  --location $REGION --generate-ssh-keys

For detailed steps, refer to the AKS quickstart guide.

Add a GPU node pool. Skip automatic GPU driver installation so the NVIDIA GPU Operator manages drivers later in this guide.
```
az aks nodepool add --resource-group $RESOURCE_GROUP --cluster-name $AKS_NAME \
  --name gpupool --node-count 1 --node-vm-size Standard_ND96amsr_A100_v4 \
  --gpu-driver none --node-osdisk-size 2048 --max-pods 110
```
Note

Azure CLI 2.72.2 or later is required for --gpu-driver none. On older CLI versions, use --skip-gpu-driver-install instead (deprecated after August 2025).

Tip

Choose a GPU-enabled VM size that matches your model requirements. For supported options, refer to GPU workloads in AKS.

Get the cluster credentials:

az aks get-credentials --resource-group $RESOURCE_GROUP --name $AKS_NAME \
  --overwrite-existing

Install the NVIDIA GPU Operator#

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia --pass-credentials
helm repo update
helm install --create-namespace --namespace gpu-operator \
  nvidia/gpu-operator --wait --generate-name

For more information, refer to the NVIDIA GPU Operator documentation.

Create Kubernetes Secrets#

export NGC_API_KEY="${YOUR_NGC_API_KEY}"
export NAMESPACE="nim-llm"
export IMAGE_PULL_SECRET="ngc-secret"

kubectl create namespace $NAMESPACE

kubectl create secret docker-registry $IMAGE_PULL_SECRET \
  --namespace $NAMESPACE \
  --docker-server=nvcr.io \
  --docker-username='$oauthtoken' \
  --docker-password="$NGC_API_KEY"

kubectl create secret generic ngc-api \
  --namespace $NAMESPACE \
  --from-literal=NGC_API_KEY="$NGC_API_KEY"

For gated Hugging Face models, create an additional secret:

export HF_TOKEN="${YOUR_HF_TOKEN}"

kubectl create secret generic hf-token \
  --namespace $NAMESPACE \
  --from-literal=HF_TOKEN="$HF_TOKEN"

Deploy NIM LLM with Helm#

Fetch the Helm chart from NGC:

export NIM_LLM_CHART_VERSION="1.0.0"   # Set to your desired chart version
helm fetch https://helm.ngc.nvidia.com/nim/charts/nim-llm-${NIM_LLM_CHART_VERSION}.tgz \
  --username='$oauthtoken' --password=$NGC_API_KEY

Optional: View the default chart values to understand available configuration options:
```
helm show values nim-llm-${NIM_LLM_CHART_VERSION}.tgz
```
Tip

For help choosing the right model configuration for your values.yaml, refer to Model Profiles and Selection.

Deploy using a custom values file:

helm install my-nim nim-llm-${NIM_LLM_CHART_VERSION}.tgz \
  --namespace $NAMESPACE \
  -f path/to/your/custom-values.yaml

Verify the Deployment#

Get the service endpoint:

kubectl -n $NAMESPACE get svc -l app.kubernetes.io/name=nim-llm

Check the health endpoint (set NIM_EXTERNAL_IP from the service EXTERNAL-IP):

export NIM_EXTERNAL_IP=$(kubectl -n $NAMESPACE get svc -l app.kubernetes.io/name=nim-llm -o jsonpath='{.items[0].status.loadBalancer.ingress[0].ip}')
curl -s "http://${NIM_EXTERNAL_IP}:8000/v1/health/ready"

Send an inference request:

curl -X POST "http://${NIM_EXTERNAL_IP}:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta/llama3.1-8b-instruct",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 128
  }'

AI Foundry Deployment#

Prerequisites#

Azure CLI with the Azure ML extension:
```
az extension add -n ml -y
```

An Azure Machine Learning workspace. Create one with the CLI:

az ml workspace create --name ${WORKSPACE_NAME} --resource-group ${RESOURCE_GROUP}

An NGC API key with access to NIM LLM container images.
The Azure AI ML Python SDK:
```
pip install azure-ai-ml azure-identity
```

Note

Some Azure regions have limited GPU quota. If you encounter quota errors, refer to Troubleshoot online endpoints: OutOfQuota.

Deploy with the Python SDK#

The following steps deploy a NIM container as an Azure ML managed online endpoint.

Tip

The nim_model_profile variable determines which model configuration is used. For help choosing the right profile, refer to Model Profiles and Selection.

Create the ML client and endpoint:

from azure.ai.ml import MLClient
from azure.ai.ml.entities import (
    ManagedOnlineEndpoint,
    ManagedOnlineDeployment,
    Environment,
    OnlineRequestSettings,
    ProbeSettings,
)
from azure.identity import DefaultAzureCredential

subscription_id = "<your-subscription-id>"
resource_group = "<your-resource-group>"
workspace_name = "<your-workspace-name>"
endpoint_name = "<your-endpoint-name>"
acr_image = "<your-acr-or-ngc-nim-image-uri>"
ngc_api_key = "<your-ngc-api-key>"
nim_model_profile = "<your-nim-model-profile>"
instance_type = "Standard_NC24ads_A100_v4"

client = MLClient(
    credential=DefaultAzureCredential(),
    subscription_id=subscription_id,
    resource_group_name=resource_group,
    workspace_name=workspace_name,
)

endpoint = ManagedOnlineEndpoint(
    name=endpoint_name,
    description=f"NIM deployment: {acr_image}",
    auth_mode="key",
)
client.online_endpoints.begin_create_or_update(endpoint).result()

Define the environment and deployment:

env = Environment(
    name=f"nim-env-{endpoint_name}",
    image=acr_image,
    inference_config={
        "liveness_route": {"port": 8000, "path": "/v1/health/live"},
        "readiness_route": {"port": 8000, "path": "/v1/health/ready"},
        "scoring_route": {"port": 8000, "path": "/"},
    },
)

deployment = ManagedOnlineDeployment(
    name="nim",
    endpoint_name=endpoint_name,
    environment=env,
    instance_type=instance_type,
    instance_count=1,
    environment-variables={
        "NGC_API_KEY": ngc_api_key,
        "NIM_MODEL_PROFILE": nim_model_profile,
    },
    request_settings=OnlineRequestSettings(
        request_timeout_ms=90000,
        max_concurrent_requests_per_instance=1,
    ),
    liveness_probe=ProbeSettings(
        initial_delay=600, period=30, timeout=10, failure_threshold=30
    ),
    readiness_probe=ProbeSettings(
        initial_delay=600, period=30, timeout=10, failure_threshold=30
    ),
)
client.online_deployments.begin_create_or_update(deployment).result()

Route traffic to the deployment:

endpoint = client.online_endpoints.get(endpoint_name)
endpoint.traffic = {"nim": 100}
client.online_endpoints.begin_create_or_update(endpoint).result()

Retrieve the endpoint URL and key:

endpoint_info = client.online_endpoints.get(endpoint_name)
keys = client.online_endpoints.get_keys(endpoint_name)
endpoint_url = endpoint_info.scoring_uri.rstrip("/")
api_key = keys.primary_key

Verify the Deployment#

Send an inference request to the endpoint:

import httpx

headers = {
    "Authorization": f"Bearer {api_key}",
    "Content-Type": "application/json",
}

payload = {
    "model": "meta/llama3.1-8b-instruct",  # or your deployed model name
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 128,
}

with httpx.Client(timeout=60) as http:
    response = http.post(
        f"{endpoint_url}/v1/chat/completions",
        headers=headers,
        json=payload,
    )
    response.raise_for_status()
    print(response.json())