Azure#

This guide covers deploying NIM LLM on Microsoft Azure, which supports two deployment paths:

Deployment Path	Description
AKS (Azure Kubernetes Service)	Self-managed Kubernetes deployment using the NIM LLM Helm chart. Provides full control over infrastructure, including GPU node pools and persistent storage.
Microsoft Foundry	Managed deployment using Azure ML Managed Online Endpoints. Simplifies infrastructure management with built-in health checks.

AKS Deployment#

To deploy NIM LLM on Azure AKS, follow these steps to create an AKS cluster with GPU nodes and prepare the environment for running NIM LLM workloads. This section will walk you through prerequisites, cluster creation, and initial setup.

Prerequisites#

Install the following tools before proceeding:

Tip

If you use preview AKS features, also install the aks-preview extension.

Log in to Azure:

az login --use-device-code

You also need an NGC API key with access to NIM LLM container images and Helm charts.

Create an AKS Cluster with GPU Nodes#

Complete the following steps to create the AKS cluster, add a GPU node pool, and get the cluster credentials.

Create a resource group and AKS cluster:

export RESOURCE_GROUP="nim-resource-group"
export REGION="southcentralus"
export AKS_NAME="nim-cluster"

az group create --name $RESOURCE_GROUP --location $REGION
az aks create --resource-group $RESOURCE_GROUP --name $AKS_NAME \
  --location $REGION --generate-ssh-keys

For detailed steps, refer to the AKS quickstart guide.

Add a GPU node pool. Skip automatic GPU driver installation so the NVIDIA GPU Operator manages drivers later in this guide.
```
az aks nodepool add --resource-group $RESOURCE_GROUP --cluster-name $AKS_NAME \
  --name gpupool --node-count 1 --node-vm-size Standard_ND96amsr_A100_v4 \
  --gpu-driver none --node-osdisk-size 2048 --max-pods 110
```
Note

Azure CLI 2.72.2 or later is required for --gpu-driver none. On older CLI versions, use --skip-gpu-driver-install instead (deprecated after August 2025).

Tip

Choose a GPU-enabled VM size that matches your model requirements. For supported options, refer to GPU workloads in AKS.

Get the cluster credentials:

az aks get-credentials --resource-group $RESOURCE_GROUP --name $AKS_NAME \
  --overwrite-existing

Install the NVIDIA GPU Operator#

Install the NVIDIA GPU Operator after the AKS cluster is ready.

Add the NVIDIA Helm repository:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia --pass-credentials

Update the local Helm repository index:
```
helm repo update
```

Install the GPU Operator:

helm install --create-namespace --namespace gpu-operator \
  nvidia/gpu-operator --wait --generate-name

For more information, refer to the NVIDIA GPU Operator documentation.

Create Kubernetes Secrets#

Create the namespace and secrets that the Helm deployment uses.

Set the required environment variables:

export NGC_API_KEY="${YOUR_NGC_API_KEY}"
export NAMESPACE="nim-llm"
export IMAGE_PULL_SECRET="ngc-secret"

Create the namespace:
```
kubectl create namespace $NAMESPACE
```

Create the NGC image pull secret:

kubectl create secret docker-registry $IMAGE_PULL_SECRET \
  --namespace $NAMESPACE \
  --docker-server=nvcr.io \
  --docker-username='$oauthtoken' \
  --docker-password="$NGC_API_KEY"

Create the NGC API key secret:

kubectl create secret generic ngc-api \
  --namespace $NAMESPACE \
  --from-literal=NGC_API_KEY="$NGC_API_KEY"

Optional: For gated Hugging Face models, create an additional secret.

Set the Hugging Face token:
```
export HF_TOKEN="${YOUR_HF_TOKEN}"
```

Create the secret:

kubectl create secret generic hf-token \
  --namespace $NAMESPACE \
  --from-literal=HF_TOKEN="$HF_TOKEN"

Deploy NIM LLM with Helm#

To deploy NVIDIA NIM for LLMs with Helm, follow these steps:

Fetch the Helm chart from NGC:

export NIM_LLM_CHART_VERSION="1.0.0"   # Set to your desired chart version
helm fetch https://helm.ngc.nvidia.com/nim/charts/nim-llm-${NIM_LLM_CHART_VERSION}.tgz \
  --username='$oauthtoken' --password=$NGC_API_KEY

Optional: View the default chart values to understand available configuration options:
```
helm show values nim-llm-${NIM_LLM_CHART_VERSION}.tgz
```
Tip

For help choosing the right model configuration for your values.yaml, refer to Model Profiles and Selection.

Deploy using a custom values file:

helm install my-nim nim-llm-${NIM_LLM_CHART_VERSION}.tgz \
  --namespace $NAMESPACE \
  -f path/to/your/custom-values.yaml

Verify the Deployment#

Complete the following steps to confirm that the service is reachable and serving inference requests.

Get the service endpoint:

kubectl -n $NAMESPACE get svc -l app.kubernetes.io/name=nim-llm

Check the health endpoint (set NIM_EXTERNAL_IP from the service EXTERNAL-IP):

export NIM_EXTERNAL_IP=$(kubectl -n $NAMESPACE get svc -l app.kubernetes.io/name=nim-llm -o jsonpath='{.items[0].status.loadBalancer.ingress[0].ip}')
curl -s "http://${NIM_EXTERNAL_IP}:8000/v1/health/ready"

Send an inference request:

curl -X POST "http://${NIM_EXTERNAL_IP}:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta/llama3.1-8b-instruct",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 128
  }'

Microsoft Foundry Deployment#

To deploy NIM on Microsoft Foundry, follow these steps to provision the necessary Azure Machine Learning resources and prepare your environment for running NIM workloads. This section will walk you through prerequisites, workspace setup, and deployment of NIM as a managed online endpoint.

Prerequisites#

Before deploying NIM to Microsoft Foundry, make sure you have the following:

The Azure CLI and Azure ML extension:
```
az extension add -n ml -y
```

An Azure Machine Learning workspace:

az ml workspace create --name ${WORKSPACE_NAME} --resource-group ${RESOURCE_GROUP}

An NGC API key for pulling NIM container images and downloading model artifacts.
The Azure AI ML Python SDK:
```
pip install azure-ai-ml azure-identity
```

Note

Some Azure regions have limited GPU quota. If you encounter quota errors, refer to Troubleshoot online endpoints: OutOfQuota.

Deploy with the Python SDK#

The following steps deploy a NIM container as an Azure ML managed online endpoint.

Tip

The nim_model_profile variable determines which model configuration is used. For help choosing the right profile, refer to Model Profiles and Selection.

Create the ML client and endpoint:

from azure.ai.ml import MLClient
from azure.ai.ml.entities import (
    ManagedOnlineEndpoint,
    ManagedOnlineDeployment,
    Environment,
    OnlineRequestSettings,
    ProbeSettings,
)
from azure.identity import DefaultAzureCredential

subscription_id = "<your-subscription-id>"
resource_group = "<your-resource-group>"
workspace_name = "<your-workspace-name>"
endpoint_name = "<your-endpoint-name>"
acr_image = "<your-acr-or-ngc-nim-image-uri>"
ngc_api_key = "<your-ngc-api-key>"
nim_model_profile = "<your-nim-model-profile>"
instance_type = "Standard_NC24ads_A100_v4"

client = MLClient(
    credential=DefaultAzureCredential(),
    subscription_id=subscription_id,
    resource_group_name=resource_group,
    workspace_name=workspace_name,
)

endpoint = ManagedOnlineEndpoint(
    name=endpoint_name,
    description=f"NIM deployment: {acr_image}",
    auth_mode="key",
)
client.online_endpoints.begin_create_or_update(endpoint).result()

Define the environment and deployment:

env = Environment(
    name=f"nim-env-{endpoint_name}",
    image=acr_image,
    inference_config={
        "liveness_route": {"port": 8000, "path": "/v1/health/live"},
        "readiness_route": {"port": 8000, "path": "/v1/health/ready"},
        "scoring_route": {"port": 8000, "path": "/"},
    },
)

deployment = ManagedOnlineDeployment(
    name="nim",
    endpoint_name=endpoint_name,
    environment=env,
    instance_type=instance_type,
    instance_count=1,
    environment_variables={
        "NGC_API_KEY": ngc_api_key,
        "NIM_MODEL_PROFILE": nim_model_profile,
    },
    request_settings=OnlineRequestSettings(
        request_timeout_ms=90000,
        max_concurrent_requests_per_instance=1,
    ),
    liveness_probe=ProbeSettings(
        initial_delay=600, period=30, timeout=10, failure_threshold=30
    ),
    readiness_probe=ProbeSettings(
        initial_delay=600, period=30, timeout=10, failure_threshold=30
    ),
)
client.online_deployments.begin_create_or_update(deployment).result()

Route traffic to the deployment:

endpoint = client.online_endpoints.get(endpoint_name)
endpoint.traffic = {"nim": 100}
client.online_endpoints.begin_create_or_update(endpoint).result()

Retrieve the endpoint URL and key:

endpoint_info = client.online_endpoints.get(endpoint_name)
keys = client.online_endpoints.get_keys(endpoint_name)
endpoint_url = endpoint_info.scoring_uri.rstrip("/")
api_key = keys.primary_key

Verify the Deployment#

Send an inference request to the endpoint:

import httpx

headers = {
    "Authorization": f"Bearer {api_key}",
    "Content-Type": "application/json",
}

payload = {
    "model": "meta/llama3.1-8b-instruct",  # or your deployed model name
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 128,
}

with httpx.Client(timeout=60) as http:
    response = http.post(
        f"{endpoint_url}/v1/chat/completions",
        headers=headers,
        json=payload,
    )
    response.raise_for_status()
    print(response.json())

References#

Refer to the following Azure documentation for more information: