Azure#

Overview#

This guide covers deploying NIM LLM on Microsoft Azure, which supports two deployment paths:

Deployment Path

Description

AKS (Azure Kubernetes Service)

Self-managed Kubernetes deployment using the NIM LLM Helm chart. Provides full control over infrastructure, including GPU node pools and persistent storage.

AI Foundry

Managed deployment using Azure ML Managed Online Endpoints. Simplifies infrastructure management with built-in health checks.

AKS Deployment#

Prerequisites#

Install the following tools before proceeding:

Log in to Azure:

az login --use-device-code

You also need an NGC API key with access to NIM LLM container images and Helm charts.

Create an AKS Cluster with GPU Nodes#

  1. Create a resource group and AKS cluster:

    export RESOURCE_GROUP="nim-resource-group"
    export REGION="southcentralus"
    export AKS_NAME="nim-cluster"
    
    az group create --name $RESOURCE_GROUP --location $REGION
    az aks create --resource-group $RESOURCE_GROUP --name $AKS_NAME \
      --location $REGION --generate-ssh-keys
    

    For detailed steps, refer to the AKS quickstart guide.

  2. Add a GPU node pool. Skip automatic GPU driver installation so the NVIDIA GPU Operator manages drivers later in this guide.

    az aks nodepool add --resource-group $RESOURCE_GROUP --cluster-name $AKS_NAME \
      --name gpupool --node-count 1 --node-vm-size Standard_ND96amsr_A100_v4 \
      --gpu-driver none --node-osdisk-size 2048 --max-pods 110
    

    Note

    Azure CLI 2.72.2 or later is required for --gpu-driver none. On older CLI versions, use --skip-gpu-driver-install instead (deprecated after August 2025).

    Tip

    Choose a GPU-enabled VM size that matches your model requirements. For supported options, refer to GPU workloads in AKS.

  3. Get the cluster credentials:

    az aks get-credentials --resource-group $RESOURCE_GROUP --name $AKS_NAME \
      --overwrite-existing
    

Install the NVIDIA GPU Operator#

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia --pass-credentials
helm repo update
helm install --create-namespace --namespace gpu-operator \
  nvidia/gpu-operator --wait --generate-name

For more information, refer to the NVIDIA GPU Operator documentation.

Create Kubernetes Secrets#

export NGC_API_KEY="${YOUR_NGC_API_KEY}"
export NAMESPACE="nim-llm"
export IMAGE_PULL_SECRET="ngc-secret"

kubectl create namespace $NAMESPACE

kubectl create secret docker-registry $IMAGE_PULL_SECRET \
  --namespace $NAMESPACE \
  --docker-server=nvcr.io \
  --docker-username='$oauthtoken' \
  --docker-password="$NGC_API_KEY"

kubectl create secret generic ngc-api \
  --namespace $NAMESPACE \
  --from-literal=NGC_API_KEY="$NGC_API_KEY"

For gated Hugging Face models, create an additional secret:

export HF_TOKEN="${YOUR_HF_TOKEN}"

kubectl create secret generic hf-token \
  --namespace $NAMESPACE \
  --from-literal=HF_TOKEN="$HF_TOKEN"

Deploy NIM LLM with Helm#

  1. Fetch the Helm chart from NGC:

    export NIM_LLM_CHART_VERSION="1.0.0"   # Set to your desired chart version
    helm fetch https://helm.ngc.nvidia.com/nim/charts/nim-llm-${NIM_LLM_CHART_VERSION}.tgz \
      --username='$oauthtoken' --password=$NGC_API_KEY
    
  2. Optional: View the default chart values to understand available configuration options:

    helm show values nim-llm-${NIM_LLM_CHART_VERSION}.tgz
    

    Tip

    For help choosing the right model configuration for your values.yaml, refer to Model Profiles and Selection.

  3. Deploy using a custom values file:

    helm install my-nim nim-llm-${NIM_LLM_CHART_VERSION}.tgz \
      --namespace $NAMESPACE \
      -f path/to/your/custom-values.yaml
    

Verify the Deployment#

  1. Get the service endpoint:

    kubectl -n $NAMESPACE get svc -l app.kubernetes.io/name=nim-llm
    
  2. Check the health endpoint (set NIM_EXTERNAL_IP from the service EXTERNAL-IP):

    export NIM_EXTERNAL_IP=$(kubectl -n $NAMESPACE get svc -l app.kubernetes.io/name=nim-llm -o jsonpath='{.items[0].status.loadBalancer.ingress[0].ip}')
    curl -s "http://${NIM_EXTERNAL_IP}:8000/v1/health/ready"
    
  3. Send an inference request:

    curl -X POST "http://${NIM_EXTERNAL_IP}:8000/v1/chat/completions" \
      -H "Content-Type: application/json" \
      -d '{
        "model": "meta/llama3.1-8b-instruct",
        "messages": [{"role": "user", "content": "Hello!"}],
        "max_tokens": 128
      }'
    

AI Foundry Deployment#

Prerequisites#

  • Azure CLI with the Azure ML extension:

    az extension add -n ml -y
    
  • An Azure Machine Learning workspace. Create one with the CLI:

    az ml workspace create --name ${WORKSPACE_NAME} --resource-group ${RESOURCE_GROUP}
    
  • An NGC API key with access to NIM LLM container images.

  • The Azure AI ML Python SDK:

    pip install azure-ai-ml azure-identity
    

Note

Some Azure regions have limited GPU quota. If you encounter quota errors, refer to Troubleshoot online endpoints: OutOfQuota.

Deploy with the Python SDK#

The following steps deploy a NIM container as an Azure ML managed online endpoint.

Tip

The nim_model_profile variable determines which model configuration is used. For help choosing the right profile, refer to Model Profiles and Selection.

  1. Create the ML client and endpoint:

    from azure.ai.ml import MLClient
    from azure.ai.ml.entities import (
        ManagedOnlineEndpoint,
        ManagedOnlineDeployment,
        Environment,
        OnlineRequestSettings,
        ProbeSettings,
    )
    from azure.identity import DefaultAzureCredential
    
    subscription_id = "<your-subscription-id>"
    resource_group = "<your-resource-group>"
    workspace_name = "<your-workspace-name>"
    endpoint_name = "<your-endpoint-name>"
    acr_image = "<your-acr-or-ngc-nim-image-uri>"
    ngc_api_key = "<your-ngc-api-key>"
    nim_model_profile = "<your-nim-model-profile>"
    instance_type = "Standard_NC24ads_A100_v4"
    
    client = MLClient(
        credential=DefaultAzureCredential(),
        subscription_id=subscription_id,
        resource_group_name=resource_group,
        workspace_name=workspace_name,
    )
    
    endpoint = ManagedOnlineEndpoint(
        name=endpoint_name,
        description=f"NIM deployment: {acr_image}",
        auth_mode="key",
    )
    client.online_endpoints.begin_create_or_update(endpoint).result()
    
  2. Define the environment and deployment:

    env = Environment(
        name=f"nim-env-{endpoint_name}",
        image=acr_image,
        inference_config={
            "liveness_route": {"port": 8000, "path": "/v1/health/live"},
            "readiness_route": {"port": 8000, "path": "/v1/health/ready"},
            "scoring_route": {"port": 8000, "path": "/"},
        },
    )
    
    deployment = ManagedOnlineDeployment(
        name="nim",
        endpoint_name=endpoint_name,
        environment=env,
        instance_type=instance_type,
        instance_count=1,
        environment-variables={
            "NGC_API_KEY": ngc_api_key,
            "NIM_MODEL_PROFILE": nim_model_profile,
        },
        request_settings=OnlineRequestSettings(
            request_timeout_ms=90000,
            max_concurrent_requests_per_instance=1,
        ),
        liveness_probe=ProbeSettings(
            initial_delay=600, period=30, timeout=10, failure_threshold=30
        ),
        readiness_probe=ProbeSettings(
            initial_delay=600, period=30, timeout=10, failure_threshold=30
        ),
    )
    client.online_deployments.begin_create_or_update(deployment).result()
    
  3. Route traffic to the deployment:

    endpoint = client.online_endpoints.get(endpoint_name)
    endpoint.traffic = {"nim": 100}
    client.online_endpoints.begin_create_or_update(endpoint).result()
    
  4. Retrieve the endpoint URL and key:

    endpoint_info = client.online_endpoints.get(endpoint_name)
    keys = client.online_endpoints.get_keys(endpoint_name)
    endpoint_url = endpoint_info.scoring_uri.rstrip("/")
    api_key = keys.primary_key
    

Verify the Deployment#

Send an inference request to the endpoint:

import httpx

headers = {
    "Authorization": f"Bearer {api_key}",
    "Content-Type": "application/json",
}

payload = {
    "model": "meta/llama3.1-8b-instruct",  # or your deployed model name
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 128,
}

with httpx.Client(timeout=60) as http:
    response = http.post(
        f"{endpoint_url}/v1/chat/completions",
        headers=headers,
        json=payload,
    )
    response.raise_for_status()
    print(response.json())