Deploying AI workloads on Red Hat OpenShift#

Deploy NVIDIA LLM NIM with Helm#

Deploy a NIM LLM inference server on OpenShift using the official Helm chart. This is the simplest path for standalone inference — no NIM Operator or KServe required. Refer to the latest documentation for more options.

Add the NIM Helm repo#

helm repo add nim https://helm.ngc.nvidia.com/nim/nvidia/ \
  --username='$oauthtoken' \
  --password=<YOUR_NGC_API_KEY>

helm repo update

Create a namespace#

oc new-project nim-llm

Grant the nonroot-v2 SCC#

The NIM container runs as UID 1000 (non-root). Grant nonroot-v2 so OpenShift honors the image’s declared user instead of assigning a random namespace UID.

oc adm policy add-scc-to-user nonroot-v2 -z default -n nim-llm

Create secrets#

If you haven’t already done so, create the NGC pull secret. Replace <YOUR_NGC_API_KEY> with your key from ngc.nvidia.com.

export NGC_API_KEY=<YOUR_NGC_API_KEY>

# Docker registry pull secret
oc create secret docker-registry ngc-secret \
  --docker-server=nvcr.io \
  --docker-username='$oauthtoken' \
  --docker-password=${NGC_API_KEY} \
  -n nim-llm

# NGC API key secret (used by NIM at runtime)
oc create secret generic ngc-api-secret \
  --from-literal=NGC_API_KEY="${NGC_API_KEY}" \
  -n nim-llm

(Optional) Create a model cache PVC#

Skip this step if you don’t want persistent model caching. Without it, the model is re-downloaded on every pod restart.

Adjust storageClassName for your cluster:

Cluster type	storageClassName
OCP with ODF	`ocs-storagecluster-cephfs`
OCP with NFS	`nfs-client`
LVM (NVMe)	`lvms-nvme`

Listing 1 nim-cache-pvc.yaml#

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nvidia-nim-cache-pvc
  namespace: nim-llm
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: ocs-storagecluster-cephfs # change to your storage class
  resources:
    requests:
      storage: 200Gi

Create the PVC:

oc apply -f nim-cache-pvc.yaml

Create a values file#

Listing 2 custom-values-openshift.yaml#

image:
  repository: <NIM_LLM_MODEL_SPECIFIC_IMAGE>
  tag: "<NIM_LLM_MODEL_SPECIFIC_TAG>"
  pullSecrets:
    - name: ngc-secret

model:
  name: "nvidia/nemotron-3-nano" #update for specific image

ngcAPISecret: ngc-api-secret # matches the secret name from Step 4
resources:
  limits:
    nvidia.com/gpu: 1
  requests:
    cpu: "2"
    memory: 8Gi
nodeSelector:
  nvidia.com/gpu.present: "true"
# Uncomment if you created a PVC in Step 5
# persistence:
#   enabled: true
#   existingClaim: "nvidia-nim-cache-pvc"

Deploy with Helm#

helm upgrade --install my-nim nim/nim-llm \
  -f custom-values-openshift.yaml \
  -n nim-llm

Watch the pod come up (first run downloads the model — can take 10-30 min):

1oc get pods -n nim-llm -l "app.kubernetes.io/name=nim-llm" -w

Test the endpoint#

Port-forward and run a smoke test:

oc port-forward svc/my-nim-nim-llm 8000:8000 -n nim-llm &

# Check health
curl http://localhost:8000/v1/health/ready

# List models
curl http://localhost:8000/v1/models

# Chat completion. Update model name if needed
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  --data-raw '{ "model": "nvidia/nemotron-3-nano", "messages": [{"role": "user", "content": "Hello"}], "max_tokens": 64 }'

Expose externally via an OpenShift Route (optional):

oc expose svc/my-nim-nim-llm --port=8000 -n nim-llm
oc get route my-nim-nim-llm -n nim-llm

Cleanup#

helm uninstall my-nim -n nim-llm
oc delete pvc -n nim-llm -l app.kubernetes.io/name=nim-llm
oc delete project nim-llm

Deploy NVIDIA NIM with the NIM Operator#

The NVIDIA NIM Operator enables Kubernetes cluster administrators to operate the software components and services necessary to deploy NVIDIA NIMs and NVIDIA NeMo microservices in Kubernetes.

The Operator can manage the lifecycle of the following microservices and the models they use:

NVIDIA NIM models, such as:
- Reasoning LLMs
- Retrieval — embedding, reranking, and other functions
- Speech
- Biology
NeMo core microservices:
- NeMo Customizer
- NeMo Evaluator
- NeMo Guardrails
- NeMo platform component microservices:
- NeMo Data Store
- NeMo Entity Store

Deploying via the NIM Operator provides low-level optimizations that streamline developer workflows with:

Model Caching: Large Language Models (LLMs) often exceed 100GB, leading to long start up times. The NIM Operator solves this by using a dedicated caching controller. It pre-downloads and stores model weights on Persistent Volumes (PVs), ensuring that when a service scales, the pods pull from local storage rather than the remote registry.
Observability: Prometheus metrics to track NIM cache, service, and pipelines deployed to your cluster. It also provides several common Kubernetes operator metrics.
Autoscaling: scale based on different metrics, either based on DCG metrics or NIM-specific metrics for the service handling requests to your cached models
Dynamic Resource Allocation (DRA): The Operator communicates directly with the underlying NVIDIA GPU hardware to ensure optimal scheduling. It leverages DRA to match the specific requirements of a model (like GPU memory or compute capability) to the available hardware without manual environment variable tuning.
Government Ready: NVIDIA NIM Operator is government ready, NVIDIA’s designation for software that meets applicable security requirements for deployment in your FedRAMP High or equivalent sovereign use case

Prerequisites#

Prior to deploying the NIM Operator, ensure you have met the prerequisites then follow the installation instructions for OpenShift to create the custom resource definitions.

Deploy with the NIMService Custom Resource#

The NIM Operator deploys standalone NIM resources as NIMService Custom Resource Definitions (CRD). By using the NIMService CRD, platform engineers can treat generative AI models as standard Kubernetes objects, integrating them directly into existing production DevOps workflows. Multiple NIM service resources can be deployed together with NIMPipeline.

Create a namespace

1oc new-project nim-demo

Create the NIMCache

This pulls and caches the model weights onto a PVC. Adjust storageClass, size, and nodeSelector for your cluster.

Listing 3 nimcache.yaml#

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: nvidia-nim-cache-pvc
  namespace: nim-demo
spec:
  nodeSelector:
    nvidia.com/gpu.present: "true"   # or pin to a specific node: kubernetes.io/hostname: <node>
  source:
    ngc:
      authSecret: ngc-api-secret
      pullSecret: ngc-secret
      modelPuller: nvcr.io/nim/nvidia/nemotron-3-nano:latest
      model:
        profiles:
          - all
  storage:
    pvc:
      create: true
      size: 300Gi
      storageClass: lvms-nvme     # change to your storage class
      volumeAccessMode: ReadWriteOnce

Apply nimcache.yaml. Watch until STATUS is Ready (model download can take 10-30 min):

1oc apply -f nimcache.yaml

Create the NIMService

Listing 4 nimservice.yaml#

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: nemotron-3-nano
  namespace: nim-demo
spec:
  authSecret: ngc-api-secret
  image:
    repository: nvcr.io/nim/nvidia/nemotron-3-nano
    tag: "latest"
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  inferencePlatform: standalone
  replicas: 1
  resources:
    limits:
      nvidia.com/gpu: "1"
    requests:
      nvidia.com/gpu: "1"
  nodeSelector:
    nvidia.com/gpu.present: "true"
  storage:
    nimCache:
      name: nvidia-nim-cache-pvc   # must match the NIMCache name above
    pvc: {}
  expose:
    service:
      type: ClusterIP
      port: 8000
  env:
    - name: NIM_CACHE_PATH
      value: /model-store
    - name: NGC_API_KEY
      valueFrom:
        secretKeyRef:
          name: ngc-api-secret
          key: NGC_API_KEY
  livenessProbe: {}
  readinessProbe: {}
  startupProbe: {}
  scale:
    hpa:
      minReplicas: 1
      maxReplicas: 0   # 0 = HPA disabled
  metrics:
    serviceMonitor: {}

Apply nimservice.yaml:

oc apply -f nimservice.yaml

# Watch until `STATUS` is `Ready`:
oc get nimservice -n nim-demo -w

Test the Endpoint

oc port-forward svc/nemotron-3-nano 8000:8000 -n nim-demo &
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/nemotron-3-nano",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 64
  }'

Expose externally via OpenShift Route

oc expose svc/nemotron-3-nano -n nim-demo
oc get route nemotron-3-nano -n nim-demo

Deploy with KServe#

An alternative to NIMService, the NVIDIA NIM Operator supports NIM through KServe. This method deploys a NIM that is managed by a purpose-built Kubernetes controller designed to automate the deployment, scaling, and management of NIM microservices. While deploying with OpenShift AI (described later in this guide) also uses KServe, use this method if:

You’re running RHOAI or ODH and want models deployed externally visible in the dashboard
You need the full KServe feature set: canary rollouts, traffic splitting, request batching, gRPC/v2 protocol support
You want autoscaling via KEDA or Knative (scale-to-zero on serverless deployments)
You need a standardized InferenceService API that works across multiple runtimes (NIM, vLLM, Triton, OpenVINO, etc.)
You want built-in token auth via Authorino / ODH model serving auth
Your team already manages other models through KServe and wants consistency

Create a namespace

1oc new-project nim-demo

Create the PVC

Adjust storageClass, size, and nodeSelector for your cluster.

Listing 5 nim-pvc.yaml#

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: nemotron-nano-pvc
  namespace: nim-demo
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  storageClassName: lvms-nvme   # change to your storage class

Create the ServingRuntime

Listing 6 serving-runtime.yaml#

apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
  name: nemotron-3-nano
  namespace: nim-demo
  annotations:
    opendatahub.io/apiProtocol: REST
    opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
    opendatahub.io/template-display-name: NVIDIA NIM
    opendatahub.io/template-name: nvidia-nim-runtime
    openshift.io/display-name: nemotron-3-nano
    runtimes.opendatahub.io/nvidia-nim: "true"
spec:
  multiModel: false
  protocolVersions:
    - grpc-v2
    - v2
  supportedModelFormats:
    - name: nemotron-3-nano
      version: latest
      autoSelect: false
      priority: 1
  annotations:
    prometheus.io/path: /metrics
    prometheus.io/port: "8000"
  imagePullSecrets:
    - name: ngc-secret
  containers:
    - name: kserve-container
      image: nvcr.io/nim/nvidia/nemotron-3-nano:latest
      ports:
        - containerPort: 8000
          protocol: TCP
      env:
        - name: NIM_CACHE_PATH
          value: /mnt/models/cache
        - name: NGC_API_KEY
          valueFrom:
            secretKeyRef:
              name: ngc-api-secret
              key: NGC_API_KEY
        - name: NIM_SERVED_MODEL_NAME
          value: nemotron-3-nano
      volumeMounts:
        - name: nim-model-cache
          mountPath: /mnt/models/cache
        - name: nim-workspace
          mountPath: /opt/nim/workspace
        - name: nim-cache
          mountPath: /.cache
        - name: shm
          mountPath: /dev/shm
  volumes:
    - name: nim-model-cache
      persistentVolumeClaim:
        claimName: nemotron-nano-pvc
    - name: nim-workspace
      emptyDir: {}
    - name: nim-cache
      emptyDir: {}
    - name: shm
      emptyDir:
        medium: Memory
        sizeLimit: 2Gi

Apply:

oc apply -f serving-runtime.yaml

Create the InferenceService

Listing 7 inference-service.yaml#

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: nemotron-3-nano
  namespace: nim-demo
  labels:
    networking.kserve.io/visibility: exposed
    opendatahub.io/dashboard: "true"
  annotations:
    openshift.io/display-name: Nemotron-3-Nano
    serving.kserve.io/deploymentMode: RawDeployment
    serving.kserve.io/stop: "false"
    haproxy.router.openshift.io/timeout: 300s
spec:
  predictor:
    automountServiceAccountToken: false
    minReplicas: 1
    maxReplicas: 1
    nodeSelector:
      nvidia.com/gpu.present: "true"   # or use feature.node.kubernetes.io/rtx: "true" for RTX nodes
    model:
      runtime: nemotron-3-nano         # must match the ServingRuntime name above
      modelFormat:
        name: nemotron-3-nano
      name: ""
      resources:
        requests:
          cpu: "2"
          memory: 8Gi
          nvidia.com/gpu: "1"
        limits:
          cpu: "8"
          memory: 64Gi
          nvidia.com/gpu: "1"
      env:
        - name: HOME
          value: /.cache
        - name: TRITON_CACHE_DIR
          value: /.cache/triton

Apply:

oc apply -f inference-service.yaml

# Watch until `READY` is `True` (first run downloads the model — can take 10-20 min):
oc get inferenceservice -n nim-demo -w

Test the external URL

export NIM_URL=$(oc get inferenceservice nemotron-3-nano -n nim-demo \
  -o jsonpath='{.status.url}')

curl ${NIM_URL}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nemotron-3-nano",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 64
  }'

Refer to the NIM Operator documentation for steps on deploying with the NIM Operator on Red Hat OpenShift.