Helm and Kubernetes#

This page describes how to deploy NVIDIA NIM for LLMs on Kubernetes using the NIM Helm chart.

Prerequisites#

Before deploying NIM with Helm, make sure you have the following:

  • A running Kubernetes cluster with GPU-capable nodes

  • Configured kubectl access to the target cluster

  • Helm 3.0.0 or later

  • An NGC API key for accessing the NIM Helm chart, pulling NIM container images, and downloading model artifacts

  • A storage class that supports persistent volumes for model caching

Note

To provision Kubernetes with NVIDIA Cloud Native Stack (CNS), refer to Using the Ansible Playbooks.

Fetch and extract the Helm chart before deployment. Go to the NGC Catalog and select the nim-llm Helm chart to pick a version. In most cases, you should select the latest version.

export HELM_CHART_VERSION="<version_number>"
helm fetch "https://helm.ngc.nvidia.com/nim/charts/nim-llm-${HELM_CHART_VERSION}.tgz" \
  --username='$oauthtoken' \
  --password="${NGC_API_KEY}"
tar -xzf "nim-llm-${HELM_CHART_VERSION}.tgz"

Configure Helm#

The following Helm values are the most important settings for a NIM deployment:

  • image.repository: NIM container image to deploy.

  • image.tag: NIM container image tag.

  • model.ngcAPISecret and imagePullSecrets: credentials required to pull images and model artifacts.

  • persistence: storage settings for model cache.

  • resources: GPU limits based on model requirements.

  • env: optional advanced runtime configuration.

Use the following commands to inspect chart documentation and defaults:

helm show readme nim-llm/
helm show values nim-llm/

Minimal Example#

Complete the following steps to deploy the minimal Helm example:

  1. Create the secrets before installing the chart:

    export NGC_API_KEY=<your_ngc_api_key>
    
  2. Create the image pull secret:

    kubectl create secret docker-registry ngc-secret \
      --docker-server=nvcr.io \
      --docker-username='$oauthtoken' \
      --docker-password="${NGC_API_KEY}"
    
  3. Create the NGC API key secret:

    kubectl create secret generic nvidia-nim-secrets \
      --from-literal=NGC_API_KEY="${NGC_API_KEY}"
    
  4. Create values.yaml with a minimal configuration:

    cat <<'EOF' > values.yaml
    image:
      repository: <NIM_LLM_MODEL_SPECIFIC_IMAGE>
      tag: "2.0.1"
    model:
      ngcAPISecret: "nvidia-nim-secrets"
    persistence:
      enabled: true
      storageClass: "nfs-client"
      accessMode: ReadWriteMany
      size: 50Gi
    resources:
      limits:
        nvidia.com/gpu: 1
    imagePullSecrets:
      - name: "ngc-secret"
    EOF
    

    Note

    Set persistence.storageClass to a StorageClass that is available in your Kubernetes cluster.

    Tip

    Adjust persistence.size based on your model size and expected cache usage.

  5. Install the release:

    helm install my-nim nim-llm/ -f values.yaml
    

These values are intentionally minimal and work as a starting point in most clusters.

Optional: Enable LoRA With Helm#

Complete the following steps to enable LoRA adapters with Helm:

  1. Create a dedicated PVC for the LoRA adapters:

    cat <<'EOF' > nvidia-nim-lora-pvc.yaml
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: nvidia-nim-lora-pvc
    spec:
      accessModes:
        - ReadWriteMany
      storageClassName: nfs-client
      resources:
        requests:
          storage: 10Gi
    EOF
    
    kubectl apply -f nvidia-nim-lora-pvc.yaml
    
  2. Add LoRA adapters to the PVC under /loras. For each adapter, create one directory that contains the adapter artifacts:

    /loras/
      adapter_name/
        adapter_config.json
        adapter_model.safetensors   # or adapter_model.bin
    

    Note

    NIM loads adapters from NIM_PEFT_SOURCE (/loras in this example). If the PVC is empty, no LoRA adapters are available at runtime.

  3. Update values.yaml:

    env:
      - name: NIM_PEFT_SOURCE
        value: /loras
    extraVolumes:
      lora-adapter:
        persistentVolumeClaim:
          claimName: nvidia-nim-lora-pvc
    extraVolumeMounts:
      lora-adapter:
        mountPath: /loras
    
  4. Apply the updated values:

    helm upgrade my-nim nim-llm/ -f values.yaml
    

For detailed LoRA configuration and runtime behavior, refer to Fine-Tuning with LoRA.

Verify Deployment#

Verify that the pods and service are ready, and then test inference with port forwarding.

  1. Check that the pods are running:

    kubectl get pods -l app.kubernetes.io/instance=my-nim
    
  2. Check that the service is available:

    kubectl get svc -l app.kubernetes.io/instance=my-nim
    
  3. Port-forward the service for local testing:

    kubectl port-forward svc/my-nim-nim-llm 8000:8000
    
  4. Call the readiness endpoint to confirm that the service is ready:

    curl -sS http://127.0.0.1:8000/v1/health/ready
    

A healthy deployment returns an HTTP 200 response from the readiness endpoint.