Deploying AI workloads on Red Hat OpenShift#
Deploy NVIDIA LLM NIM with Helm#
Deploy a NIM LLM inference server on OpenShift using the official Helm chart. This is the simplest path for standalone inference — no NIM Operator or KServe required. Refer to the latest documentation for more options.
Add the NIM Helm repo#
1helm repo add nim https://helm.ngc.nvidia.com/nim/nvidia/ \
2 --username='$oauthtoken' \
3 --password=<YOUR_NGC_API_KEY>
4
5helm repo update
Create a namespace#
oc new-project nim-llm
Grant the nonroot-v2 SCC#
The NIM container runs as UID 1000 (non-root). Grant nonroot-v2 so OpenShift honors the image’s declared user instead of assigning a random namespace UID.
oc adm policy add-scc-to-user nonroot-v2 -z default -n nim-llm
Create secrets#
If you haven’t already done so, create the NGC pull secret. Replace <YOUR_NGC_API_KEY> with your key from ngc.nvidia.com.
1export NGC_API_KEY=<YOUR_NGC_API_KEY>
2
3# Docker registry pull secret
4oc create secret docker-registry ngc-secret \
5 --docker-server=nvcr.io \
6 --docker-username='$oauthtoken' \
7 --docker-password=${NGC_API_KEY} \
8 -n nim-llm
9
10# NGC API key secret (used by NIM at runtime)
11oc create secret generic ngc-api-secret \
12 --from-literal=NGC_API_KEY="${NGC_API_KEY}" \
13 -n nim-llm
(Optional) Create a model cache PVC#
Skip this step if you don’t want persistent model caching. Without it, the model is re-downloaded on every pod restart.
Adjust storageClassName for your cluster:
Cluster type |
storageClassName |
|---|---|
OCP with ODF |
|
OCP with NFS |
|
LVM (NVMe) |
|
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nvidia-nim-cache-pvc
namespace: nim-llm
spec:
accessModes:
- ReadWriteMany
storageClassName: ocs-storagecluster-cephfs # change to your storage class
resources:
requests:
storage: 200Gi
Create the PVC:
oc apply -f nim-cache-pvc.yaml
Create a values file#
image:
repository: <NIM_LLM_MODEL_SPECIFIC_IMAGE>
tag: "<NIM_LLM_MODEL_SPECIFIC_TAG>"
pullSecrets:
- name: ngc-secret
model:
name: "nvidia/nemotron-3-nano" #update for specific image
ngcAPISecret: ngc-api-secret # matches the secret name from Step 4
resources:
limits:
nvidia.com/gpu: 1
requests:
cpu: "2"
memory: 8Gi
nodeSelector:
nvidia.com/gpu.present: "true"
# Uncomment if you created a PVC in Step 5
# persistence:
# enabled: true
# existingClaim: "nvidia-nim-cache-pvc"
Deploy with Helm#
1helm upgrade --install my-nim nim/nim-llm \
2 -f custom-values-openshift.yaml \
3 -n nim-llm
Watch the pod come up (first run downloads the model — can take 10-30 min):
1oc get pods -n nim-llm -l "app.kubernetes.io/name=nim-llm" -w
Test the endpoint#
Port-forward and run a smoke test:
1oc port-forward svc/my-nim-nim-llm 8000:8000 -n nim-llm &
2
3# Check health
4curl http://localhost:8000/v1/health/ready
5
6# List models
7curl http://localhost:8000/v1/models
8
9# Chat completion. Update model name if needed
10curl -X POST http://localhost:8000/v1/chat/completions \
11 -H "Content-Type: application/json" \
12 --data-raw '{ "model": "nvidia/nemotron-3-nano", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 64 }'
Expose externally via an OpenShift Route (optional):
1oc expose svc/my-nim-nim-llm --port=8000 -n nim-llm
2oc get route my-nim-nim-llm -n nim-llm
Cleanup#
1helm uninstall my-nim -n nim-llm
2oc delete pvc -n nim-llm -l app.kubernetes.io/name=nim-llm
3oc delete project nim-llm
Deploy NVIDIA NIM with the NIM Operator#
The NVIDIA NIM Operator enables Kubernetes cluster administrators to operate the software components and services necessary to deploy NVIDIA NIMs and NVIDIA NeMo microservices in Kubernetes.
The Operator can manage the lifecycle of the following microservices and the models they use:
NVIDIA NIM models, such as:
Reasoning LLMs
Retrieval — embedding, reranking, and other functions
Speech
Biology
NeMo core microservices:
NeMo Customizer
NeMo Evaluator
NeMo Guardrails
NeMo platform component microservices:
NeMo Data Store
NeMo Entity Store
Deploying via the NIM Operator provides low-level optimizations that streamline developer workflows with:
Model Caching: Large Language Models (LLMs) often exceed 100GB, leading to long start up times. The NIM Operator solves this by using a dedicated caching controller. It pre-downloads and stores model weights on Persistent Volumes (PVs), ensuring that when a service scales, the pods pull from local storage rather than the remote registry.
Observability: Prometheus metrics to track NIM cache, service, and pipelines deployed to your cluster. It also provides several common Kubernetes operator metrics.
Autoscaling: scale based on different metrics, either based on DCG metrics or NIM-specific metrics for the service handling requests to your cached models
Dynamic Resource Allocation (DRA): The Operator communicates directly with the underlying NVIDIA GPU hardware to ensure optimal scheduling. It leverages DRA to match the specific requirements of a model (like GPU memory or compute capability) to the available hardware without manual environment variable tuning.
Government Ready: NVIDIA NIM Operator is government ready, NVIDIA’s designation for software that meets applicable security requirements for deployment in your FedRAMP High or equivalent sovereign use case
Prerequisites#
Prior to deploying the NIM Operator, ensure you have met the prerequisites then follow the installation instructions for OpenShift to create the custom resource definitions.
Deploy with the NIMService Custom Resource#
The NIM Operator deploys standalone NIM resources as NIMService Custom Resource Definitions (CRD). By using the NIMService CRD, platform engineers can treat generative AI models as standard Kubernetes objects, integrating them directly into existing production DevOps workflows. Multiple NIM service resources can be deployed together with NIMPipeline.
Create a namespace
1oc new-project nim-demo
Create the NIMCache
This pulls and caches the model weights onto a PVC. Adjust storageClass, size, and nodeSelector for your cluster.
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
name: nvidia-nim-cache-pvc
namespace: nim-demo
spec:
nodeSelector:
nvidia.com/gpu.present: "true" # or pin to a specific node: kubernetes.io/hostname: <node>
source:
ngc:
authSecret: ngc-api-secret
pullSecret: ngc-secret
modelPuller: nvcr.io/nim/nvidia/nemotron-3-nano:latest
model:
profiles:
- all
storage:
pvc:
create: true
size: 300Gi
storageClass: lvms-nvme # change to your storage class
volumeAccessMode: ReadWriteOnce
Apply nimcache.yaml. Watch until STATUS is Ready (model download can take 10-30 min):
1oc apply -f nimcache.yaml
Create the NIMService
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
name: nemotron-3-nano
namespace: nim-demo
spec:
authSecret: ngc-api-secret
image:
repository: nvcr.io/nim/nvidia/nemotron-3-nano
tag: "latest"
pullPolicy: IfNotPresent
pullSecrets:
- ngc-secret
inferencePlatform: standalone
replicas: 1
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
nodeSelector:
nvidia.com/gpu.present: "true"
storage:
nimCache:
name: nvidia-nim-cache-pvc # must match the NIMCache name above
pvc: {}
expose:
service:
type: ClusterIP
port: 8000
env:
- name: NIM_CACHE_PATH
value: /model-store
- name: NGC_API_KEY
valueFrom:
secretKeyRef:
name: ngc-api-secret
key: NGC_API_KEY
livenessProbe: {}
readinessProbe: {}
startupProbe: {}
scale:
hpa:
minReplicas: 1
maxReplicas: 0 # 0 = HPA disabled
metrics:
serviceMonitor: {}
Apply nimservice.yaml:
1oc apply -f nimservice.yaml
2
3# Watch until `STATUS` is `Ready`:
4oc get nimservice -n nim-demo -w
Test the Endpoint
1oc port-forward svc/nemotron-3-nano 8000:8000 -n nim-demo &
2curl http://localhost:8000/v1/chat/completions \
3 -H "Content-Type: application/json" \
4 -d '{
5 "model": "nvidia/nemotron-3-nano",
6 "messages": [{"role": "user", "content": "Hello!"}],
7 "max_tokens": 64
8 }'
Expose externally via OpenShift Route
1oc expose svc/nemotron-3-nano -n nim-demo
2oc get route nemotron-3-nano -n nim-demo
Deploy with KServe#
An alternative to NIMService, the NVIDIA NIM Operator supports NIM through KServe. This method deploys a NIM that is managed by a purpose-built Kubernetes controller designed to automate the deployment, scaling, and management of NIM microservices. While deploying with OpenShift AI (described later in this guide) also uses KServe, use this method if:
You’re running RHOAI or ODH and want models deployed externally visible in the dashboard
You need the full KServe feature set: canary rollouts, traffic splitting, request batching, gRPC/v2 protocol support
You want autoscaling via KEDA or Knative (scale-to-zero on serverless deployments)
You need a standardized
InferenceServiceAPI that works across multiple runtimes (NIM, vLLM, Triton, OpenVINO, etc.)You want built-in token auth via Authorino / ODH model serving auth
Your team already manages other models through KServe and wants consistency
Create a namespace
1oc new-project nim-demo
Create the PVC
Adjust storageClass, size, and nodeSelector for your cluster.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: nemotron-nano-pvc
namespace: nim-demo
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
storageClassName: lvms-nvme # change to your storage class
Create the ServingRuntime
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: nemotron-3-nano
namespace: nim-demo
annotations:
opendatahub.io/apiProtocol: REST
opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
opendatahub.io/template-display-name: NVIDIA NIM
opendatahub.io/template-name: nvidia-nim-runtime
openshift.io/display-name: nemotron-3-nano
runtimes.opendatahub.io/nvidia-nim: "true"
spec:
multiModel: false
protocolVersions:
- grpc-v2
- v2
supportedModelFormats:
- name: nemotron-3-nano
version: latest
autoSelect: false
priority: 1
annotations:
prometheus.io/path: /metrics
prometheus.io/port: "8000"
imagePullSecrets:
- name: ngc-secret
containers:
- name: kserve-container
image: nvcr.io/nim/nvidia/nemotron-3-nano:latest
ports:
- containerPort: 8000
protocol: TCP
env:
- name: NIM_CACHE_PATH
value: /mnt/models/cache
- name: NGC_API_KEY
valueFrom:
secretKeyRef:
name: ngc-api-secret
key: NGC_API_KEY
- name: NIM_SERVED_MODEL_NAME
value: nemotron-3-nano
volumeMounts:
- name: nim-model-cache
mountPath: /mnt/models/cache
- name: nim-workspace
mountPath: /opt/nim/workspace
- name: nim-cache
mountPath: /.cache
- name: shm
mountPath: /dev/shm
volumes:
- name: nim-model-cache
persistentVolumeClaim:
claimName: nemotron-nano-pvc
- name: nim-workspace
emptyDir: {}
- name: nim-cache
emptyDir: {}
- name: shm
emptyDir:
medium: Memory
sizeLimit: 2Gi
Apply:
oc apply -f serving-runtime.yaml
Create the InferenceService
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: nemotron-3-nano
namespace: nim-demo
labels:
networking.kserve.io/visibility: exposed
opendatahub.io/dashboard: "true"
annotations:
openshift.io/display-name: Nemotron-3-Nano
serving.kserve.io/deploymentMode: RawDeployment
serving.kserve.io/stop: "false"
haproxy.router.openshift.io/timeout: 300s
spec:
predictor:
automountServiceAccountToken: false
minReplicas: 1
maxReplicas: 1
nodeSelector:
nvidia.com/gpu.present: "true" # or use feature.node.kubernetes.io/rtx: "true" for RTX nodes
model:
runtime: nemotron-3-nano # must match the ServingRuntime name above
modelFormat:
name: nemotron-3-nano
name: ""
resources:
requests:
cpu: "2"
memory: 8Gi
nvidia.com/gpu: "1"
limits:
cpu: "8"
memory: 64Gi
nvidia.com/gpu: "1"
env:
- name: HOME
value: /.cache
- name: TRITON_CACHE_DIR
value: /.cache/triton
Apply:
1oc apply -f inference-service.yaml
2
3# Watch until `READY` is `True` (first run downloads the model — can take 10-20 min):
4oc get inferenceservice -n nim-demo -w
Test the external URL
1export NIM_URL=$(oc get inferenceservice nemotron-3-nano -n nim-demo \
2 -o jsonpath='{.status.url}')
3
4curl ${NIM_URL}/v1/chat/completions \
5 -H "Content-Type: application/json" \
6 -d '{
7 "model": "nemotron-3-nano",
8 "messages": [{"role": "user", "content": "Hello!"}],
9 "max_tokens": 64
10 }'
Refer to the NIM Operator documentation for steps on deploying with the NIM Operator on Red Hat OpenShift.