Is this page helpful?

NIM Operator Deployment#

The NVIDIA NIM Operator is a Kubernetes operator that manages the lifecycle of NIMs in Kubernetes environments, including deployment, scaling, model caching, and health monitoring.

Deploy NIM LLM with the NVIDIA NIM Operator through these Kubernetes-native custom resources:

NIMCache — Manages model artifact caching on persistent storage, so models are downloaded during the initial cache population and reused across pod restarts and scaling events.
NIMService — Manages the NIM deployment lifecycle, including pod creation, health probes, service exposure, and GPU resource scheduling.

Using the NIM Operator simplifies NIM deployment on Kubernetes by handling common concerns such as image pull secrets, NGC authentication, persistent storage provisioning, and probe configuration.

For operator source code and issue tracking, refer to the NIM Operator GitHub repository.

Prerequisites#

Before deploying NIM with the NVIDIA NIM Operator, make sure you have the following:

A Kubernetes 1.26+ cluster with kubectl access
The NVIDIA GPU Operator installed in the cluster
A persistent storage provisioner in the cluster for model caching
An NGC API key for pulling NIM container images and downloading model artifacts
NIM Operator version 3.0.2+ for NIM LLM compatibility

Installation#

Refer to the NVIDIA NIM Operator installation guide to install the NIM Operator.

Deploy NIM#

Complete the following steps to deploy NIM with the NVIDIA NIM Operator.

Create a Namespace#

Create a dedicated Kubernetes namespace for NIM. A namespace is a logical partition within a cluster. It isolates resources such as pods, services, secrets, and so on from other workloads. Creating a dedicated namespace for NIM keeps resources organized and helps avoid naming conflicts with other applications in the cluster.

kubectl create ns nim-service

Create Secrets#

Create the following secrets for image pulls and model downloads.

Create an image pull secret for NGC container registry access:

export NGC_API_KEY=${YOUR_NGC_API_KEY}

kubectl create secret -n nim-service docker-registry ngc-secret \
  --docker-server=nvcr.io \
  --docker-username='$oauthtoken' \
  --docker-password=${NGC_API_KEY}

Create a secret with the NGC API key for model downloads:

kubectl create secret -n nim-service generic ngc-api-secret \
  --from-literal=NGC_API_KEY=${NGC_API_KEY}

Create a NIMCache#

Complete the following steps to create a NIMCache and wait for the model artifacts to be cached. This avoids re-downloading the model every time a pod starts:

Create a NIMCache custom resource to pre-download model artifacts to persistent storage:

kubectl create -n nim-service -f - <<'EOF'
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: meta-llama3-8b-instruct
spec:
  source:
    ngc:
      modelPuller: <NIM_LLM_MODEL_SPECIFIC_IMAGE>:2.0.7
      pullSecret: ngc-secret
      authSecret: ngc-api-secret
      model:
        engine: "vllm"
        tensorParallelism: "1"
  storage:
    pvc:
      create: true
      size: "80Gi"
      volumeAccessMode: ReadWriteOnce
EOF

Note

Set engine to match the inference backend in your container image (vllm or sglang).

Monitor the caching progress:
```
kubectl get nimcache -n nim-service -w
```
Wait until the NIMCache status shows Ready before proceeding.

Create a NIMService#

After the cache is ready, create a NIMService custom resource to deploy the NIM service.

kubectl create -n nim-service -f - <<'EOF'
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama3-8b-instruct
spec:
  image:
    repository: <NIM_LLM_MODEL_SPECIFIC_IMAGE>
    tag: "2.0.7"
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: meta-llama3-8b-instruct
      profile: ''
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000
EOF

Verify the Deployment#

Verify that the NIMService and pod are ready, and then test inference with port forwarding.

Check the NIMService status:
```
kubectl get nimservice -n nim-service
```
Check that the pod is running:
```
kubectl get pods -n nim-service
```

After the NIMService is ready, test inference with port forwarding:

kubectl port-forward -n nim-service svc/meta-llama3-8b-instruct 8000:8000

Query the /v1/models endpoint to discover the model name to use in inference requests:
```
curl -s http://localhost:8000/v1/models
```

Send a chat completion request using the model name returned by /v1/models:

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta/llama-3.1-8b-instruct",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 64
  }'

Passing Configuration to NIM#

The NIM Operator sets environment variables on the NIM pod. To pass additional NIM or backend configuration, use the NIMService env field.

For the full list of supported environment variables, refer to Environment Variables. For available API endpoints (inference and management), refer to the API Reference.

The following examples are independent options. Each snippet shows a standalone configuration that you can add to your NIMService spec. Apply only the options that are relevant to your deployment. They are not sequential steps.

Backend Passthrough Options

Pass additional backend arguments (for example, prefix caching) through NIM_PASSTHROUGH_ARGS. The passthrough flag names must match the active backend in your container image (vLLM or SGLang). The following example uses vLLM flags:

spec:
  env:
    - name: NIM_PASSTHROUGH_ARGS
      value: "--enable-prefix-caching --max-num-batched-tokens 8192"

Model Profile

Select a specific model profile by setting spec.storage.nimCache.profile:

spec:
  storage:
    nimCache:
      name: meta-llama3-8b-instruct
      profile: 'vllm-fp8-tp1-pp1'

On an SGLang container image, use the matching sglang- profile description (or its hash) instead, for example sglang-fp8-tp1-pp1.

Log Level

The operator sets NIM_JSONL_LOGGING=1 and NIM_LOG_LEVEL=INFO by default. Override the log level:

spec:
  env:
    - name: NIM_LOG_LEVEL
      value: "DEBUG"

For logging configuration details, refer to Logging and Observability.

Monitoring and Autoscaling#

Use the following sections to configure metrics collection and autoscaling for operator-managed deployments.

Metrics#

NIM LLM exposes the backend’s native Prometheus metrics at /v1/metrics. Use these metrics for autoscaling, alerting, and capacity planning in operator deployments.

Note

Metric names are backend-native and differ between vLLM and SGLang. Query the deployment’s /v1/metrics endpoint to confirm the exact metric names exposed by your container image.

For available metrics, scrape configuration, and dashboard setup, refer to Logging and Observability. For the vLLM backend, refer to the vLLM production metrics documentation.

ServiceMonitor#

The NIMService supports creating a Prometheus ServiceMonitor for automatic metric scraping. The operator hardcodes the metrics endpoint path to /v1/metrics:

spec:
  metrics:
    enabled: true
    serviceMonitor:
      additionalLabels: {}

Autoscaling#

The NIM Operator supports horizontal pod autoscaling (HPA) based on custom Prometheus metrics. Standard CPU and memory metrics are not useful for scaling NIM. Use inference-specific metrics instead.

The following example shows HPA configuration in a NIMService spec:

spec:
  scale:
    enabled: true
    hpa:
      minReplicas: 1
      maxReplicas: 4
      metrics:
        - type: Pods
          pods:
            metric:
              name: vllm:num_requests_waiting
            target:
              type: AverageValue
              averageValue: "10"

Note

The vllm:num_requests_waiting metric name is backend-native and applies to the vLLM backend. On an SGLang image, the waiting-queue metric name differs. Query the deployment’s /v1/metrics endpoint to find the correct metric name for your backend before configuring the HPA.

This requires the Prometheus Adapter to be installed and configured to expose the backend’s metrics as Kubernetes custom metrics.

Troubleshooting#

Note

The examples below use the resource names from this guide (for example, meta-llama3-8b-instruct). Replace them with the names of your own NIMCache and NIMService resources.

NIMCache Stuck in Pending or NotReady State#

If NIMCache remains in Pending or NotReady, use the following checks to isolate the issue:

Verify the storage class is available and can provision volumes:
```
kubectl get storageclass
```
Check the PVC status:
```
kubectl get pvc -n nim-service
```
The NIMCache controller creates a Kubernetes Job named <nimcache-name>-job to download model artifacts. Check the job status and its pod logs:
```
kubectl get jobs -n nim-service
kubectl logs -n nim-service -l job-name=meta-llama3-8b-instruct-job
```
If the job fails, it retries up to five times, which is the default backoffLimit. Check for authentication errors, network issues, or insufficient disk space in the job pod logs.
Ensure the NGC API key secret is correct and the image pull secret has access to the NIM container image.

Check the NIMCache status conditions for details:

kubectl describe nimcache -n nim-service meta-llama3-8b-instruct

NIMService Pod Fails to Start#

If the NIMService pod fails to start, use the following checks to isolate the issue:

Check pod events:

kubectl describe pod -n nim-service -l app=meta-llama3-8b-instruct

Check pod logs:

kubectl logs -n nim-service -l app=meta-llama3-8b-instruct

Verify GPU resources are available:

kubectl get nodes -o json | jq '.items[].status.allocatable["nvidia.com/gpu"]'

Confirm the NIMCache is in Ready state before deploying the NIMService. The NIM pod mounts the cache PVC at /model-store and will fail if the model artifacts are not present.
If you use a specific model profile through spec.storage.nimCache.profile, verify that the profile name matches one available in the cached model manifest.

Health Probes Fail#

The nginx proxy in the NIM container serves the health endpoints. /v1/health/live responds immediately, even during model loading, while /v1/health/ready only returns 200 after the model is fully loaded. By default, both endpoints use the NIM service port (NIM_SERVER_PORT, which defaults to 8000). If NIM_HEALTH_PORT is set, the health endpoints move to that dedicated port instead.

The operator configures a startup probe on /v1/health/ready with a default failureThreshold of 120 and a periodSeconds of 10, giving the model up to 20 minutes to load before the pod is killed. If readiness or startup probes are failing:

For large models that take longer to load, increase the startup probe failure threshold in the NIMService spec:

spec:
  startupProbe:
    enabled: true
    probe:
      httpGet:
        path: /v1/health/ready
        port: 8000
      initialDelaySeconds: 30
      periodSeconds: 10
      failureThreshold: 240

Check that spec.expose.service.port matches the port used by the NIM container (default 8000). The operator sets NIM_SERVER_PORT automatically from this value.

NIM Operator Pod Not Running#

If the NIMService or NIMCache resources are not being reconciled, verify the operator itself is healthy:

kubectl get pods -n nim-operator
kubectl logs -n nim-operator -l app.kubernetes.io/name=k8s-nim-operator

Additional Resources#

Refer to the following resources for more information:

NVIDIA NIM Operator Documentation — operator installation, NIMService CRD, and managing NIM on Kubernetes.
NVIDIA NIM Operator GitHub Repository — source code, issues, and contributing.
NVIDIA GPU Operator Documentation — GPU enablement and device configuration in the cluster.
Environment Variables — all NIM configuration options.
API Reference — inference and management endpoints.
Logging and Observability — structured logging, metrics, and tracing.
Advanced Configuration — server-level options, TLS, and other runtime settings.