NIM Operator Deployment#
The NVIDIA NIM Operator is a Kubernetes operator that manages the lifecycle of NIMs in Kubernetes environments, including deployment, scaling, model caching, and health monitoring.
Deploy NIM LLM with the NVIDIA NIM Operator through these Kubernetes-native custom resources:
NIMCache — Manages model artifact caching on persistent storage, so models are downloaded during the initial cache population and reused across pod restarts and scaling events.
NIMService — Manages the NIM deployment lifecycle, including pod creation, health probes, service exposure, and GPU resource scheduling.
Using the NIM Operator simplifies NIM deployment on Kubernetes by handling common concerns such as image pull secrets, NGC authentication, persistent storage provisioning, and probe configuration.
For operator source code and issue tracking, refer to the NIM Operator GitHub repository.
Prerequisites#
Before deploying NIM with the NVIDIA NIM Operator, make sure you have the following:
A Kubernetes 1.26+ cluster with
kubectlaccessThe NVIDIA GPU Operator installed in the cluster
A persistent storage provisioner in the cluster for model caching
An NGC API key for pulling NIM container images and downloading model artifacts
NIM Operator version 3.0.2+ for NIM LLM compatibility
Installation#
Refer to the NVIDIA NIM Operator installation guide to install the NIM Operator.
Deploy NIM#
Complete the following steps to deploy NIM with the NVIDIA NIM Operator.
Create a Namespace#
Create a dedicated Kubernetes namespace for NIM. A namespace is a logical partition within a cluster. It isolates resources such as pods, services, secrets, and so on from other workloads. Creating a dedicated namespace for NIM keeps resources organized and helps avoid naming conflicts with other applications in the cluster.
kubectl create ns nim-service
Create Secrets#
Create the following secrets for image pulls and model downloads.
Create an image pull secret for NGC container registry access:
export NGC_API_KEY=${YOUR_NGC_API_KEY} kubectl create secret -n nim-service docker-registry ngc-secret \ --docker-server=nvcr.io \ --docker-username='$oauthtoken' \ --docker-password=${NGC_API_KEY}
Create a secret with the NGC API key for model downloads:
kubectl create secret -n nim-service generic ngc-api-secret \ --from-literal=NGC_API_KEY=${NGC_API_KEY}
Create a NIMCache#
Complete the following steps to create a NIMCache and wait for the model artifacts to be cached. This avoids re-downloading the model every time a pod starts:
Create a
NIMCachecustom resource to pre-download model artifacts to persistent storage:kubectl create -n nim-service -f - <<'EOF' apiVersion: apps.nvidia.com/v1alpha1 kind: NIMCache metadata: name: meta-llama3-8b-instruct spec: source: ngc: modelPuller: <NIM_LLM_MODEL_SPECIFIC_IMAGE>:2.0.1 pullSecret: ngc-secret authSecret: ngc-api-secret model: engine: "vllm" tensorParallelism: "1" storage: pvc: create: true size: "80Gi" volumeAccessMode: ReadWriteOnce EOF
Monitor the caching progress:
kubectl get nimcache -n nim-service -w
Wait until the
NIMCachestatus showsReadybefore proceeding.
Create a NIMService#
After the cache is ready, create a NIMService custom resource to deploy the NIM service.
kubectl create -n nim-service -f - <<'EOF'
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
name: meta-llama3-8b-instruct
spec:
image:
repository: <NIM_LLM_MODEL_SPECIFIC_IMAGE>
tag: "2.0.1"
pullPolicy: IfNotPresent
pullSecrets:
- ngc-secret
authSecret: ngc-api-secret
storage:
nimCache:
name: meta-llama3-8b-instruct
profile: ''
resources:
limits:
nvidia.com/gpu: 1
expose:
service:
type: ClusterIP
port: 8000
EOF
Verify the Deployment#
Verify that the NIMService and pod are ready, and then test inference with port forwarding.
Check the NIMService status:
kubectl get nimservice -n nim-service
Check that the pod is running:
kubectl get pods -n nim-service
After the NIMService is ready, test inference with port forwarding:
kubectl port-forward -n nim-service svc/meta-llama3-8b-instruct 8000:8000
Query the
/v1/modelsendpoint to discover the model name to use in inference requests:curl -s http://localhost:8000/v1/models
Send a chat completion request using the model name returned by
/v1/models:curl -s http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "meta/llama-3.1-8b-instruct", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 64 }'
Passing Configuration to NIM#
The NIM Operator sets environment variables on the NIM pod. To pass additional NIM or vLLM configuration, use the NIMService env field.
For the full list of supported environment variables, refer to Environment Variables. For available API endpoints (inference and management), refer to the API Reference.
The following examples are independent options. Each snippet shows a standalone configuration that you can add to your NIMService spec. Apply only the options that are relevant to your deployment. They are not sequential steps.
Pass additional vLLM arguments (for example, prefix caching) through NIM_PASSTHROUGH_ARGS:
spec:
env:
- name: NIM_PASSTHROUGH_ARGS
value: "--enable-prefix-caching --max-num-batched-tokens 8192"
Select a specific model profile by setting spec.storage.nimCache.profile:
spec:
storage:
nimCache:
name: meta-llama3-8b-instruct
profile: 'vllm-fp8-tp1-pp1'
The operator sets NIM_JSONL_LOGGING=1 and NIM_LOG_LEVEL=INFO by default. Override the log level:
spec:
env:
- name: NIM_LOG_LEVEL
value: "DEBUG"
For logging configuration details, refer to Logging and Observability.
Monitoring and Autoscaling#
Metrics#
NIM LLM exposes vLLM’s native Prometheus metrics at /v1/metrics. Use these metrics for autoscaling, alerting, and capacity planning in operator deployments.
For available metrics, scrape configuration, and dashboard setup, refer to Logging and Observability and the vLLM production metrics documentation.
ServiceMonitor#
The NIMService supports creating a Prometheus ServiceMonitor for automatic metric scraping. The operator hardcodes the metrics endpoint path to /v1/metrics:
spec:
metrics:
enabled: true
serviceMonitor:
additionalLabels: {}
Autoscaling#
The NIM Operator supports horizontal pod autoscaling (HPA) based on custom Prometheus metrics. Standard CPU and memory metrics are not useful for scaling NIM. Use inference-specific metrics instead.
The following example shows HPA configuration in a NIMService spec:
spec:
scale:
enabled: true
hpa:
minReplicas: 1
maxReplicas: 4
metrics:
- type: Pods
pods:
metric:
name: vllm:num_requests_waiting
target:
type: AverageValue
averageValue: "10"
This requires the Prometheus Adapter to be installed and configured to expose vLLM metrics as Kubernetes custom metrics.
Troubleshooting#
Note
The examples below use the resource names from this guide (for example, meta-llama3-8b-instruct). Replace them with the names of your own NIMCache and NIMService resources.
NIMCache Stuck in Pending or NotReady State#
If NIMCache remains in Pending or NotReady, use the following checks to isolate the issue:
Verify the storage class is available and can provision volumes:
kubectl get storageclass
Check the PVC status:
kubectl get pvc -n nim-service
The NIMCache controller creates a Kubernetes Job named
<nimcache-name>-jobto download model artifacts. Check the job status and its pod logs:kubectl get jobs -n nim-service kubectl logs -n nim-service -l job-name=meta-llama3-8b-instruct-job
If the job fails, it retries up to five times, which is the default
backoffLimit. Check for authentication errors, network issues, or insufficient disk space in the job pod logs.Ensure the NGC API key secret is correct and the image pull secret has access to the NIM container image.
Check the NIMCache status conditions for details:
kubectl describe nimcache -n nim-service meta-llama3-8b-instruct
NIMService Pod Fails to Start#
If the NIMService pod fails to start, use the following checks to isolate the issue:
Check pod events:
kubectl describe pod -n nim-service -l app=meta-llama3-8b-instruct
Check pod logs:
kubectl logs -n nim-service -l app=meta-llama3-8b-instruct
Verify GPU resources are available:
kubectl get nodes -o json | jq '.items[].status.allocatable["nvidia.com/gpu"]'
Confirm the NIMCache is in
Readystate before deploying the NIMService. The NIM pod mounts the cache PVC at/model-storeand will fail if the model artifacts are not present.If you use a specific model profile through
spec.storage.nimCache.profile, verify that the profile name matches one available in the cached model manifest.
Health Probes Fail#
The nginx proxy in the NIM container serves the health endpoints. /v1/health/live responds immediately, even during model loading, while /v1/health/ready only returns 200 after the model is fully loaded. By default, both endpoints use the NIM service port (NIM_SERVER_PORT, which defaults to 8000). If NIM_HEALTH_PORT is set, the health endpoints move to that dedicated port instead.
The operator configures a startup probe on /v1/health/ready with a default failureThreshold of 120 and a periodSeconds of 10, giving the model up to 20 minutes to load before the pod is killed. If readiness or startup probes are failing:
For large models that take longer to load, increase the startup probe failure threshold in the NIMService spec:
spec: startupProbe: enabled: true probe: httpGet: path: /v1/health/ready port: 8000 initialDelaySeconds: 30 periodSeconds: 10 failureThreshold: 240
Check that
spec.expose.service.portmatches the port used by the NIM container (default 8000). The operator setsNIM_SERVER_PORTautomatically from this value.
NIM Operator Pod Not Running#
If the NIMService or NIMCache resources are not being reconciled, verify the operator itself is healthy:
kubectl get pods -n nim-operator
kubectl logs -n nim-operator -l app.kubernetes.io/name=k8s-nim-operator
Additional Resources#
Refer to the following resources for more information:
NVIDIA NIM Operator Documentation — operator installation, NIMService CRD, and managing NIM on Kubernetes.
NVIDIA NIM Operator GitHub Repository — source code, issues, and contributing.
NVIDIA GPU Operator Documentation — GPU enablement and device configuration in the cluster.
Environment Variables — all NIM configuration options.
API Reference — inference and management endpoints.
Logging and Observability — structured logging, metrics, and tracing.
Advanced Configuration — server-level options, TLS, and other runtime settings.