NIM Operator Deployment#

The NVIDIA NIM Operator is a Kubernetes operator that manages the lifecycle of NIMs in Kubernetes environments, including deployment, scaling, model caching, and health monitoring.

Deploy NIM LLM with the NVIDIA NIM Operator through these Kubernetes-native custom resources:

  • NIMCache — Manages model artifact caching on persistent storage, so models are downloaded during the initial cache population and reused across pod restarts and scaling events.

  • NIMService — Manages the NIM deployment lifecycle, including pod creation, health probes, service exposure, and GPU resource scheduling.

Using the NIM Operator simplifies NIM deployment on Kubernetes by handling common concerns such as image pull secrets, NGC authentication, persistent storage provisioning, and probe configuration.

For operator source code and issue tracking, refer to the NIM Operator GitHub repository.

Prerequisites#

Before deploying NIM with the NVIDIA NIM Operator, make sure you have the following:

  • A Kubernetes 1.26+ cluster with kubectl access

  • The NVIDIA GPU Operator installed in the cluster

  • A persistent storage provisioner in the cluster for model caching

  • An NGC API key for pulling NIM container images and downloading model artifacts

  • NIM Operator version 3.0.2+ for NIM LLM compatibility

Installation#

Refer to the NVIDIA NIM Operator installation guide to install the NIM Operator.

Deploy NIM#

Complete the following steps to deploy NIM with the NVIDIA NIM Operator.

Create a Namespace#

Create a dedicated Kubernetes namespace for NIM. A namespace is a logical partition within a cluster. It isolates resources such as pods, services, secrets, and so on from other workloads. Creating a dedicated namespace for NIM keeps resources organized and helps avoid naming conflicts with other applications in the cluster.

kubectl create ns nim-service

Create Secrets#

Create the following secrets for image pulls and model downloads.

  1. Create an image pull secret for NGC container registry access:

    export NGC_API_KEY=${YOUR_NGC_API_KEY}
    
    kubectl create secret -n nim-service docker-registry ngc-secret \
      --docker-server=nvcr.io \
      --docker-username='$oauthtoken' \
      --docker-password=${NGC_API_KEY}
    
  2. Create a secret with the NGC API key for model downloads:

    kubectl create secret -n nim-service generic ngc-api-secret \
      --from-literal=NGC_API_KEY=${NGC_API_KEY}
    

Create a NIMCache#

Complete the following steps to create a NIMCache and wait for the model artifacts to be cached. This avoids re-downloading the model every time a pod starts:

  1. Create a NIMCache custom resource to pre-download model artifacts to persistent storage:

    kubectl create -n nim-service -f - <<'EOF'
    apiVersion: apps.nvidia.com/v1alpha1
    kind: NIMCache
    metadata:
      name: meta-llama3-8b-instruct
    spec:
      source:
        ngc:
          modelPuller: <NIM_LLM_MODEL_SPECIFIC_IMAGE>:2.0.1
          pullSecret: ngc-secret
          authSecret: ngc-api-secret
          model:
            engine: "vllm"
            tensorParallelism: "1"
      storage:
        pvc:
          create: true
          size: "80Gi"
          volumeAccessMode: ReadWriteOnce
    EOF
    
  2. Monitor the caching progress:

    kubectl get nimcache -n nim-service -w
    
  3. Wait until the NIMCache status shows Ready before proceeding.

Create a NIMService#

After the cache is ready, create a NIMService custom resource to deploy the NIM service.

kubectl create -n nim-service -f - <<'EOF'
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama3-8b-instruct
spec:
  image:
    repository: <NIM_LLM_MODEL_SPECIFIC_IMAGE>
    tag: "2.0.1"
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: meta-llama3-8b-instruct
      profile: ''
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000
EOF

Verify the Deployment#

Verify that the NIMService and pod are ready, and then test inference with port forwarding.

  1. Check the NIMService status:

    kubectl get nimservice -n nim-service
    
  2. Check that the pod is running:

    kubectl get pods -n nim-service
    
  3. After the NIMService is ready, test inference with port forwarding:

    kubectl port-forward -n nim-service svc/meta-llama3-8b-instruct 8000:8000
    
  4. Query the /v1/models endpoint to discover the model name to use in inference requests:

    curl -s http://localhost:8000/v1/models
    
  5. Send a chat completion request using the model name returned by /v1/models:

    curl -s http://localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "meta/llama-3.1-8b-instruct",
        "messages": [{"role": "user", "content": "Hello!"}],
        "max_tokens": 64
      }'
    

Passing Configuration to NIM#

The NIM Operator sets environment variables on the NIM pod. To pass additional NIM or vLLM configuration, use the NIMService env field.

For the full list of supported environment variables, refer to Environment Variables. For available API endpoints (inference and management), refer to the API Reference.

The following examples are independent options. Each snippet shows a standalone configuration that you can add to your NIMService spec. Apply only the options that are relevant to your deployment. They are not sequential steps.

Pass additional vLLM arguments (for example, prefix caching) through NIM_PASSTHROUGH_ARGS:

spec:
  env:
    - name: NIM_PASSTHROUGH_ARGS
      value: "--enable-prefix-caching --max-num-batched-tokens 8192"

Select a specific model profile by setting spec.storage.nimCache.profile:

spec:
  storage:
    nimCache:
      name: meta-llama3-8b-instruct
      profile: 'vllm-fp8-tp1-pp1'

The operator sets NIM_JSONL_LOGGING=1 and NIM_LOG_LEVEL=INFO by default. Override the log level:

spec:
  env:
    - name: NIM_LOG_LEVEL
      value: "DEBUG"

For logging configuration details, refer to Logging and Observability.

Monitoring and Autoscaling#

Metrics#

NIM LLM exposes vLLM’s native Prometheus metrics at /v1/metrics. Use these metrics for autoscaling, alerting, and capacity planning in operator deployments.

For available metrics, scrape configuration, and dashboard setup, refer to Logging and Observability and the vLLM production metrics documentation.

ServiceMonitor#

The NIMService supports creating a Prometheus ServiceMonitor for automatic metric scraping. The operator hardcodes the metrics endpoint path to /v1/metrics:

spec:
  metrics:
    enabled: true
    serviceMonitor:
      additionalLabels: {}

Autoscaling#

The NIM Operator supports horizontal pod autoscaling (HPA) based on custom Prometheus metrics. Standard CPU and memory metrics are not useful for scaling NIM. Use inference-specific metrics instead.

The following example shows HPA configuration in a NIMService spec:

spec:
  scale:
    enabled: true
    hpa:
      minReplicas: 1
      maxReplicas: 4
      metrics:
        - type: Pods
          pods:
            metric:
              name: vllm:num_requests_waiting
            target:
              type: AverageValue
              averageValue: "10"

This requires the Prometheus Adapter to be installed and configured to expose vLLM metrics as Kubernetes custom metrics.

Troubleshooting#

Note

The examples below use the resource names from this guide (for example, meta-llama3-8b-instruct). Replace them with the names of your own NIMCache and NIMService resources.

NIMCache Stuck in Pending or NotReady State#

If NIMCache remains in Pending or NotReady, use the following checks to isolate the issue:

  • Verify the storage class is available and can provision volumes:

    kubectl get storageclass
    
  • Check the PVC status:

    kubectl get pvc -n nim-service
    
  • The NIMCache controller creates a Kubernetes Job named <nimcache-name>-job to download model artifacts. Check the job status and its pod logs:

    kubectl get jobs -n nim-service
    kubectl logs -n nim-service -l job-name=meta-llama3-8b-instruct-job
    

    If the job fails, it retries up to five times, which is the default backoffLimit. Check for authentication errors, network issues, or insufficient disk space in the job pod logs.

  • Ensure the NGC API key secret is correct and the image pull secret has access to the NIM container image.

  • Check the NIMCache status conditions for details:

    kubectl describe nimcache -n nim-service meta-llama3-8b-instruct
    

NIMService Pod Fails to Start#

If the NIMService pod fails to start, use the following checks to isolate the issue:

  • Check pod events:

    kubectl describe pod -n nim-service -l app=meta-llama3-8b-instruct
    
  • Check pod logs:

    kubectl logs -n nim-service -l app=meta-llama3-8b-instruct
    
  • Verify GPU resources are available:

    kubectl get nodes -o json | jq '.items[].status.allocatable["nvidia.com/gpu"]'
    
  • Confirm the NIMCache is in Ready state before deploying the NIMService. The NIM pod mounts the cache PVC at /model-store and will fail if the model artifacts are not present.

  • If you use a specific model profile through spec.storage.nimCache.profile, verify that the profile name matches one available in the cached model manifest.

Health Probes Fail#

The nginx proxy in the NIM container serves the health endpoints. /v1/health/live responds immediately, even during model loading, while /v1/health/ready only returns 200 after the model is fully loaded. By default, both endpoints use the NIM service port (NIM_SERVER_PORT, which defaults to 8000). If NIM_HEALTH_PORT is set, the health endpoints move to that dedicated port instead.

The operator configures a startup probe on /v1/health/ready with a default failureThreshold of 120 and a periodSeconds of 10, giving the model up to 20 minutes to load before the pod is killed. If readiness or startup probes are failing:

  • For large models that take longer to load, increase the startup probe failure threshold in the NIMService spec:

    spec:
      startupProbe:
        enabled: true
        probe:
          httpGet:
            path: /v1/health/ready
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
          failureThreshold: 240
    
  • Check that spec.expose.service.port matches the port used by the NIM container (default 8000). The operator sets NIM_SERVER_PORT automatically from this value.

NIM Operator Pod Not Running#

If the NIMService or NIMCache resources are not being reconciled, verify the operator itself is healthy:

kubectl get pods -n nim-operator
kubectl logs -n nim-operator -l app.kubernetes.io/name=k8s-nim-operator

Additional Resources#

Refer to the following resources for more information: