NIM Operator Deployment#
The NVIDIA NIM Operator is a Kubernetes operator that manages the lifecycle of NVIDIA Inference Microservices (NIMs) in Kubernetes environments, including deployment, scaling, model caching, and health monitoring.
This guide walks through deploying NIM LLM using the NIM Operator.
Overview#
The NIM Operator provides Kubernetes-native custom resources for managing NIMs:
NIMCache — Manages model artifact caching on persistent storage, so models are downloaded once and reused across pod restarts and scaling events.
NIMService — Manages the NIM deployment lifecycle, including pod creation, health probes, service exposure, and GPU resource scheduling.
Using the NIM Operator simplifies NIM deployment on Kubernetes by handling common concerns such as image pull secrets, NGC authentication, persistent storage provisioning, and probe configuration.
For operator source code and issue tracking, refer to the NIM Operator GitHub repository.
Prerequisites#
A Kubernetes cluster with
kubectlaccess (Kubernetes 1.26+)NVIDIA GPU Operator installed
Persistent storage provisioner available in the cluster (for model caching)
An NGC API key for downloading NIM containers and model artifacts
NIM Operator version 3.0.2+ (required for NIM LLM 2.0 compatibility)
Installation#
Refer to the NVIDIA NIM Operator installation guide to install the NIM Operator.
Deploy a NIM#
Step 1: Create a Namespace#
A Kubernetes namespace is a logical partition within a cluster that isolates resources — pods, services, secrets, and so on — from other workloads. Creating a dedicated namespace for NIM keeps its resources organized and avoids naming conflicts with other applications in the cluster.
kubectl create ns nim-service
Step 2: Create Secrets#
Create an image pull secret for NGC container registry access:
export NGC_API_KEY=${YOUR_NGC_API_KEY}
kubectl create secret -n nim-service docker-registry ngc-secret \
--docker-server=nvcr.io \
--docker-username='$oauthtoken' \
--docker-password=${NGC_API_KEY}
Create a secret with the NGC API key for model downloads:
kubectl create secret -n nim-service generic ngc-api-secret \
--from-literal=NGC_API_KEY=${NGC_API_KEY}
Step 3: Create a NIMCache#
The NIMCache custom resource pre-downloads model artifacts to persistent storage. This avoids re-downloading the model every time a pod starts.
kubectl create -n nim-service -f - <<'EOF'
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
name: meta-llama3-8b-instruct
spec:
source:
ngc:
modelPuller: <NIM_LLM_MODEL_SPECIFIC_IMAGE>:2.0.0
pullSecret: ngc-secret
authSecret: ngc-api-secret
model:
engine: "vllm"
tensorParallelism: "1"
storage:
pvc:
create: true
size: "80Gi"
volumeAccessMode: ReadWriteOnce
EOF
Monitor the caching progress:
kubectl get nimcache -n nim-service -w
Wait until the NIMCache status shows Ready before proceeding.
Step 4: Create a NIMService#
After the cache is ready, deploy the NIM service:
kubectl create -n nim-service -f - <<'EOF'
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
name: meta-llama3-8b-instruct
spec:
image:
repository: <NIM_LLM_MODEL_SPECIFIC_IMAGE>
tag: "2.0.0"
pullPolicy: IfNotPresent
pullSecrets:
- ngc-secret
authSecret: ngc-api-secret
storage:
nimCache:
name: meta-llama3-8b-instruct
profile: ''
resources:
limits:
nvidia.com/gpu: 1
expose:
service:
type: ClusterIP
port: 8000
EOF
Step 5: Verify the Deployment#
Check the NIMService status:
kubectl get nimservice -n nim-service
Check that the pod is running:
kubectl get pods -n nim-service
After the NIMService is ready, the model is being served. Test it with a port-forward:
kubectl port-forward -n nim-service svc/meta-llama3-8b-instruct 8000:8000
Query the /v1/models endpoint to discover the model name to use in inference requests:
curl -s http://localhost:8000/v1/models
Send a chat completion request using the model name returned by /v1/models:
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta/llama-3.1-8b-instruct",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 64
}'
Passing Configuration to NIM#
The NIM Operator sets environment variables on the NIM pod. To pass additional NIM or vLLM configuration, use the NIMService env field.
For the full list of supported environment variables, refer to Environment Variables. For available API endpoints (inference and management), refer to the API Reference.
The following examples are independent options — each snippet shows a standalone configuration you can add to your NIMService spec. Apply only the ones relevant to your deployment; they are not sequential steps.
Pass additional vLLM arguments (for example, prefix caching) through NIM_PASSTHROUGH_ARGS:
spec:
env:
- name: NIM_PASSTHROUGH_ARGS
value: "--enable-prefix-caching --max-num-batched-tokens 8192"
Select a specific model profile by setting spec.storage.nimCache.profile:
spec:
storage:
nimCache:
name: meta-llama3-8b-instruct
profile: 'vllm-fp8-tp1-pp1'
The operator sets NIM_JSONL_LOGGING=1 and NIM_LOG_LEVEL=INFO by default. Override the log level:
spec:
env:
- name: NIM_LOG_LEVEL
value: "DEBUG"
For logging configuration details, refer to Logging and Observability.
Monitoring and Autoscaling#
Metrics#
NIM LLM 2.0 exposes vLLM’s native Prometheus metrics at /v1/metrics. Use these metrics for autoscaling, alerting, and capacity planning in operator deployments.
For available metrics, scrape configuration, and dashboard setup, refer to Logging and Observability and the vLLM production metrics documentation.
ServiceMonitor#
The NIMService supports creating a Prometheus ServiceMonitor for automatic metric scraping. The operator hardcodes the metrics endpoint path to /v1/metrics:
spec:
metrics:
enabled: true
serviceMonitor:
additionalLabels: {}
Autoscaling#
The NIM Operator supports horizontal pod autoscaling (HPA) based on custom Prometheus metrics. Standard CPU and memory metrics are not useful for scaling NIM — use inference-specific metrics instead.
Example HPA configuration in NIMService spec:
spec:
scale:
enabled: true
hpa:
minReplicas: 1
maxReplicas: 4
metrics:
- type: Pods
pods:
metric:
name: vllm:num_requests_waiting
target:
type: AverageValue
averageValue: "10"
This requires the Prometheus Adapter to be installed and configured to expose vLLM metrics as Kubernetes custom metrics.
Troubleshooting#
Note
The examples below use the resource names from this guide (for example, meta-llama3-8b-instruct). Replace them with the names of your own NIMCache and NIMService resources.
NIMCache Stuck in Pending or NotReady State#
Verify the storage class is available and can provision volumes:
kubectl get storageclass
Check the PVC status:
kubectl get pvc -n nim-service
The NIMCache controller creates a Kubernetes Job named
<nimcache-name>-jobto download model artifacts. Check the job status and its pod logs:kubectl get jobs -n nim-service kubectl logs -n nim-service -l job-name=meta-llama3-8b-instruct-job
If the job fails, it retries up to 5 times (the default
backoffLimit). Check for authentication errors, network issues, or insufficient disk space in the job pod logs.Ensure the NGC API key secret is correct and the image pull secret has access to the NIM container image.
Check the NIMCache status conditions for details:
kubectl describe nimcache -n nim-service meta-llama3-8b-instruct
NIMService Pod Fails to Start#
Check pod events:
kubectl describe pod -n nim-service -l app=meta-llama3-8b-instruct
Check pod logs:
kubectl logs -n nim-service -l app=meta-llama3-8b-instruct
Verify GPU resources are available:
kubectl get nodes -o json | jq '.items[].status.allocatable["nvidia.com/gpu"]'
Confirm the NIMCache is in
Readystate before deploying the NIMService. The NIM pod mounts the cache PVC at/model-storeand will fail if the model artifacts are not present.If using a specific model profile via
spec.storage.nimCache.profile, verify the profile name matches one available in the cached model manifest.
Health Probes Fail#
The NIM container serves health endpoints directly. /v1/health/live responds immediately (even during model loading), while /v1/health/ready only returns 200 after the model is fully loaded.
The operator configures a startup probe on /v1/health/ready with a default failureThreshold of 120 and a periodSeconds of 10, giving the model up to 20 minutes to load before the pod is killed. If readiness or startup probes are failing:
For large models that take longer to load, increase the startup probe failure threshold in the NIMService spec:
spec: startupProbe: enabled: true probe: httpGet: path: /v1/health/ready port: 8000 initialDelaySeconds: 30 periodSeconds: 10 failureThreshold: 240
Check that
spec.expose.service.portmatches the port used by the NIM container (default 8000). The operator setsNIM_SERVER_PORTautomatically from this value.
NIM Operator Pod Not Running#
If the NIMService or NIMCache resources are not being reconciled, verify the operator itself is healthy:
kubectl get pods -n nim-operator
kubectl logs -n nim-operator -l app.kubernetes.io/name=k8s-nim-operator
Additional Resources#
NVIDIA NIM Operator Documentation — operator installation, NIMService CRD, and managing NIM on Kubernetes.
NVIDIA NIM Operator GitHub Repository — source code, issues, and contributing.
NVIDIA GPU Operator Documentation — GPU enablement and device configuration in the cluster.
Environment Variables — all NIM configuration options.
API Reference — inference and management endpoints.
Logging and Observability — structured logging, metrics, and tracing.
Advanced Configuration — server-level options, TLS, and other runtime settings.