NIM Operator Deployment#

The NVIDIA NIM Operator is a Kubernetes operator that manages the lifecycle of NVIDIA Inference Microservices (NIMs) in Kubernetes environments, including deployment, scaling, model caching, and health monitoring.

This guide walks through deploying NIM LLM using the NIM Operator.

Overview#

The NIM Operator provides Kubernetes-native custom resources for managing NIMs:

  • NIMCache — Manages model artifact caching on persistent storage, so models are downloaded once and reused across pod restarts and scaling events.

  • NIMService — Manages the NIM deployment lifecycle, including pod creation, health probes, service exposure, and GPU resource scheduling.

Using the NIM Operator simplifies NIM deployment on Kubernetes by handling common concerns such as image pull secrets, NGC authentication, persistent storage provisioning, and probe configuration.

For operator source code and issue tracking, refer to the NIM Operator GitHub repository.

Prerequisites#

  • A Kubernetes cluster with kubectl access (Kubernetes 1.26+)

  • NVIDIA GPU Operator installed

  • Persistent storage provisioner available in the cluster (for model caching)

  • An NGC API key for downloading NIM containers and model artifacts

  • NIM Operator version 3.0.2+ (required for NIM LLM 2.0 compatibility)

Installation#

Refer to the NVIDIA NIM Operator installation guide to install the NIM Operator.

Deploy a NIM#

Step 1: Create a Namespace#

A Kubernetes namespace is a logical partition within a cluster that isolates resources — pods, services, secrets, and so on — from other workloads. Creating a dedicated namespace for NIM keeps its resources organized and avoids naming conflicts with other applications in the cluster.

kubectl create ns nim-service

Step 2: Create Secrets#

Create an image pull secret for NGC container registry access:

export NGC_API_KEY=${YOUR_NGC_API_KEY}

kubectl create secret -n nim-service docker-registry ngc-secret \
  --docker-server=nvcr.io \
  --docker-username='$oauthtoken' \
  --docker-password=${NGC_API_KEY}

Create a secret with the NGC API key for model downloads:

kubectl create secret -n nim-service generic ngc-api-secret \
  --from-literal=NGC_API_KEY=${NGC_API_KEY}

Step 3: Create a NIMCache#

The NIMCache custom resource pre-downloads model artifacts to persistent storage. This avoids re-downloading the model every time a pod starts.

kubectl create -n nim-service -f - <<'EOF'
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: meta-llama3-8b-instruct
spec:
  source:
    ngc:
      modelPuller: <NIM_LLM_MODEL_SPECIFIC_IMAGE>:2.0.0
      pullSecret: ngc-secret
      authSecret: ngc-api-secret
      model:
        engine: "vllm"
        tensorParallelism: "1"
  storage:
    pvc:
      create: true
      size: "80Gi"
      volumeAccessMode: ReadWriteOnce
EOF

Monitor the caching progress:

kubectl get nimcache -n nim-service -w

Wait until the NIMCache status shows Ready before proceeding.

Step 4: Create a NIMService#

After the cache is ready, deploy the NIM service:

kubectl create -n nim-service -f - <<'EOF'
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama3-8b-instruct
spec:
  image:
    repository: <NIM_LLM_MODEL_SPECIFIC_IMAGE>
    tag: "2.0.0"
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: meta-llama3-8b-instruct
      profile: ''
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000
EOF

Step 5: Verify the Deployment#

Check the NIMService status:

kubectl get nimservice -n nim-service

Check that the pod is running:

kubectl get pods -n nim-service

After the NIMService is ready, the model is being served. Test it with a port-forward:

kubectl port-forward -n nim-service svc/meta-llama3-8b-instruct 8000:8000

Query the /v1/models endpoint to discover the model name to use in inference requests:

curl -s http://localhost:8000/v1/models

Send a chat completion request using the model name returned by /v1/models:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta/llama-3.1-8b-instruct",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 64
  }'

Passing Configuration to NIM#

The NIM Operator sets environment variables on the NIM pod. To pass additional NIM or vLLM configuration, use the NIMService env field.

For the full list of supported environment variables, refer to Environment Variables. For available API endpoints (inference and management), refer to the API Reference.

The following examples are independent options — each snippet shows a standalone configuration you can add to your NIMService spec. Apply only the ones relevant to your deployment; they are not sequential steps.

Pass additional vLLM arguments (for example, prefix caching) through NIM_PASSTHROUGH_ARGS:

spec:
  env:
    - name: NIM_PASSTHROUGH_ARGS
      value: "--enable-prefix-caching --max-num-batched-tokens 8192"

Select a specific model profile by setting spec.storage.nimCache.profile:

spec:
  storage:
    nimCache:
      name: meta-llama3-8b-instruct
      profile: 'vllm-fp8-tp1-pp1'

The operator sets NIM_JSONL_LOGGING=1 and NIM_LOG_LEVEL=INFO by default. Override the log level:

spec:
  env:
    - name: NIM_LOG_LEVEL
      value: "DEBUG"

For logging configuration details, refer to Logging and Observability.

Monitoring and Autoscaling#

Metrics#

NIM LLM 2.0 exposes vLLM’s native Prometheus metrics at /v1/metrics. Use these metrics for autoscaling, alerting, and capacity planning in operator deployments.

For available metrics, scrape configuration, and dashboard setup, refer to Logging and Observability and the vLLM production metrics documentation.

ServiceMonitor#

The NIMService supports creating a Prometheus ServiceMonitor for automatic metric scraping. The operator hardcodes the metrics endpoint path to /v1/metrics:

spec:
  metrics:
    enabled: true
    serviceMonitor:
      additionalLabels: {}

Autoscaling#

The NIM Operator supports horizontal pod autoscaling (HPA) based on custom Prometheus metrics. Standard CPU and memory metrics are not useful for scaling NIM — use inference-specific metrics instead.

Example HPA configuration in NIMService spec:

spec:
  scale:
    enabled: true
    hpa:
      minReplicas: 1
      maxReplicas: 4
      metrics:
        - type: Pods
          pods:
            metric:
              name: vllm:num_requests_waiting
            target:
              type: AverageValue
              averageValue: "10"

This requires the Prometheus Adapter to be installed and configured to expose vLLM metrics as Kubernetes custom metrics.

Troubleshooting#

Note

The examples below use the resource names from this guide (for example, meta-llama3-8b-instruct). Replace them with the names of your own NIMCache and NIMService resources.

NIMCache Stuck in Pending or NotReady State#

  • Verify the storage class is available and can provision volumes:

    kubectl get storageclass
    
  • Check the PVC status:

    kubectl get pvc -n nim-service
    
  • The NIMCache controller creates a Kubernetes Job named <nimcache-name>-job to download model artifacts. Check the job status and its pod logs:

    kubectl get jobs -n nim-service
    kubectl logs -n nim-service -l job-name=meta-llama3-8b-instruct-job
    
  • If the job fails, it retries up to 5 times (the default backoffLimit). Check for authentication errors, network issues, or insufficient disk space in the job pod logs.

  • Ensure the NGC API key secret is correct and the image pull secret has access to the NIM container image.

  • Check the NIMCache status conditions for details:

    kubectl describe nimcache -n nim-service meta-llama3-8b-instruct
    

NIMService Pod Fails to Start#

  • Check pod events:

    kubectl describe pod -n nim-service -l app=meta-llama3-8b-instruct
    
  • Check pod logs:

    kubectl logs -n nim-service -l app=meta-llama3-8b-instruct
    
  • Verify GPU resources are available:

    kubectl get nodes -o json | jq '.items[].status.allocatable["nvidia.com/gpu"]'
    
  • Confirm the NIMCache is in Ready state before deploying the NIMService. The NIM pod mounts the cache PVC at /model-store and will fail if the model artifacts are not present.

  • If using a specific model profile via spec.storage.nimCache.profile, verify the profile name matches one available in the cached model manifest.

Health Probes Fail#

The NIM container serves health endpoints directly. /v1/health/live responds immediately (even during model loading), while /v1/health/ready only returns 200 after the model is fully loaded.

The operator configures a startup probe on /v1/health/ready with a default failureThreshold of 120 and a periodSeconds of 10, giving the model up to 20 minutes to load before the pod is killed. If readiness or startup probes are failing:

  • For large models that take longer to load, increase the startup probe failure threshold in the NIMService spec:

    spec:
      startupProbe:
        enabled: true
        probe:
          httpGet:
            path: /v1/health/ready
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
          failureThreshold: 240
    
  • Check that spec.expose.service.port matches the port used by the NIM container (default 8000). The operator sets NIM_SERVER_PORT automatically from this value.

NIM Operator Pod Not Running#

If the NIMService or NIMCache resources are not being reconciled, verify the operator itself is healthy:

kubectl get pods -n nim-operator
kubectl logs -n nim-operator -l app.kubernetes.io/name=k8s-nim-operator

Additional Resources#