KServe Support on NIM Operator#

About KServe Support#

The NIM Operator supports both raw deployment and serverless deployment of NIM through KServe on Kubernetes clusters, including Red Hat OpenShift Container Platform.

NIM Operator with KServe provides two additional benefits:

  1. Intelligent Caching with NIMCache to reduce initial inference time and autoscaling latency, resulting in faster and more responsive deployments.

  2. NeMo Microservices support to evaluate, guardrail, and enhance AI systems across key metrics such as latency, accuracy, cost, and compliance.

The Operator configures KServe deployments using InferenceService to manage deployment, upgrade, ingress, and autoscaling of NIM.

Diagram 1. NIM Operator and KServe interaction

KServe Deployment Modes Comparison#

Category

KServe serverless (with Knative)

Raw Deployment

Autoscaling

Scales pods automatically based on request load (speed and traffic). Can scale all the way to zero when unused.

These are configured by providing Knative annotations in the NIMService spec.annotations field.

Uses Horizontal/Vertical Pod Autoscalers with custom NIM metrics. Cannot scale to zero by default.

These are configured by providing spec.scaling.hpa config in the NIM service.

Upgrades

Every new model version creates a new revision. Can send only a portion of traffic to test new versions. Easy rollback by shifting traffic.

This is automatically managed by KServe. No special parameters required in the NIMService API. Refer to the Knative documentation on gradual rollouts and traffic management for configuring canary rollouts.

Uses rolling updates. Rollback is manual if something goes wrong.

This is not configurable. Only RollingUpdate of native Kubernetes deployment is supported.

Ingress

Uses Knative Gateway (such as Istio). Gives each model revision a stable domain/URL. Built-in secure connections. Has a queue system to handle overload.

This is automatically managed by KServe. No special parameters are added. The domain name has to be configured during KServe install (default example.com).

Exposed using Kubernetes Service + Ingress/LoadBalancer (NGINX, Istio, etc.) Security (mTLS, certificates) must be set up manually.

This is automatically managed by KServe. No special parameters are added. The domain name has to be configured during KServe install (default example.com).

NIM Metrics

Metrics and tracing are built in through Knative, KServe, and service mesh.

No extra config in the NIMService is required.

Must build your own stack. For example, Prometheus, Grafana, OpenTelemetry, etc.

ServiceMonitor needs to be enabled through the NIMService spec.metrics.enabled field set to true and spec.metrics.serviceMonitor config.

GPU Resources

Passed using the spec.resources field.

Passed using the spec.resources field.

Multi-Node

Not supported.

Not supported.

Dynamic Resource Allocation (DRA)

Not supported.

Not supported.

Tolerations

Passed through spec.tolerations in NIMService.

Feature gates must be enabled with the Knative config.

Passed through spec.tolerations in NIMService.

RuntimeClassName

Passed through spec.runtimeClassName in NIMService.

Feature gates must be enabled with the Knative config.

Passed through spec.runtimeClassName in NIMService.

NodeSelectors

Passed through spec.nodeSelectors in NIMService.

Feature gates must be enabled with the Knative config.

Passed through spec.nodeSelectors in NIMService.

Custom Scheduler Name

Passed through spec.schedulerName in NIMService.

Feature gates must be enabled with the Knative config.

Passed through spec.schedulerName in NIMService.

Select Your Deployment Environment and Type#

Raw Deployment
on Standard Kubernetes

#raw-deployment-example-on-standard-kubernetes

Raw Deployment
on Red Hat OpenShift

#raw-deployment-example-on-red-hat-openshift

Serverless Deployment
on Standard Kubernetes

#serverless-deployment-example-on-standard-kubernetes

Note

This documentation uses kubectl for Kubernetes examples and oc for OpenShift examples. Both tools provide similar functionality for their respective platforms.

Raw Deployment Example on Standard Kubernetes#

Summary#

For raw deployment of NIM through KServe on a standard Kubernetes installation, follow these steps:

  1. Install KServe.

  2. Optional: Create NIM Cache.

  3. Deploy NIM through KServe.

1. Install KServe in Raw Deployment Mode#

Note

For security, consider downloading and reviewing the script before execution in production environments.

Run the following command to execute the KServe quick install script:

$ curl -s "https://raw.githubusercontent.com/kserve/kserve/release-0.15/hack/quick_install.sh" | bash -s - -r

For more information, refer to Getting Started with KServe.

The following components are deployed by the KServe quick install script:

Component

Description

KServe

  • Installs KServe CRDs, such as InferenceService.

  • Installs KServe controller in the kserve namespace.

  • Provides model serving, autoscaling, and inference management on Kubernetes.

Gateway API CRDs

  • Installs the Gateway API CustomResourceDefinitions: GatewayClass, Gateway, GRPCRoute, HTTPRoute, ReferenceGrant.

  • Provides modern networking primitives for routing traffic into services.

Istio (Service Mesh)

Deployed into the istio-system namespace with Helm charts:

  • istio-base: core Istio CRDs and cluster-scoped resources.

  • istiod: Istio control plane (pilot, configuration, service discovery).

  • istio-ingressgateway: data plane ingress gateway for external traffic.

Cert-Manager

  • Installed in the cert-manager namespace.

  • Handles TLS certificate provisioning and management.

  • Required for automatic HTTPS and secure communication.

Verify the installation by checking each component:

  1. KServe

    $ kubectl get pods -n kserve
    
    Example output
    NAME                                         READY   STATUS    RESTARTS   AGE
    kserve-controller-manager-85768d7b78-lrmss   2/2     Running   0          2m56s
    
  2. Istio System

    $ kubectl get pods -n istio-system
    
    Example output
    NAME                                    READY   STATUS    RESTARTS   AGE
    istio-ingressgateway-74949b4866-hk5nf   1/1     Running   0          3m50s
    istiod-77bbc7c8bb-qkwdt                 1/1     Running   0          4m1s
    
  3. Cert Manager

    $ kubectl get pods -n cert-manager
    
    Example output
    NAME                                       READY   STATUS    RESTARTS   AGE
    cert-manager-74b7f6cbbc-7mrrw              1/1     Running   0          4m18s
    cert-manager-cainjector-58c9d76cb8-t8bxb   1/1     Running   0          4m18s
    cert-manager-webhook-5875b545cf-dhhfb      1/1     Running   0          4m18s
    

Note

To uninstall KServe, follow the instructions in Uninstalling KServe.

2. Optional: Create NIM Cache#

Note

Refer to prerequisites for more information on using NIM Cache.

  1. Create a file, such as nimcache.yaml, with contents like the following sample manifest:

    apiVersion: apps.nvidia.com/v1alpha1
    kind: NIMCache
    metadata:
      name: meta-llama-3-2-1b-instruct
      namespace: nim-service
    spec:
      source:
        ngc:
          modelPuller: nvcr.io/nim/meta/llama-3.2-1b-instruct:1.12.0
          pullSecret: ngc-secret
          authSecret: ngc-api-secret
          model:
            engine: "tensorrt"
            tensorParallelism: "1"
      storage:
        pvc:
          create: true
          storageClass:
          size: "50Gi"
          volumeAccessMode: ReadWriteOnce
    
  2. Apply the manifest:

    $ kubectl apply -n nim-service -f nimcache.yaml
    
  1. Create a file, such as nimcache.yaml, with contents like the following sample manifest:

    apiVersion: apps.nvidia.com/v1alpha1
    kind: NIMCache
    metadata:
      name: nim-cache-multi-llm
      namespace: nim-service
    spec:
      source:
        hf:
          endpoint: "https://huggingface.co"
          namespace: "nvidia"
          authSecret: hf-api-secret
          modelPuller: nvcr.io/nim/nvidia/llm-nim:1.12
          pullSecret: ngc-secret
          modelName: "Llama-3.1-Nemotron-Nano-8B-v1"
      storage:
        pvc:
          create: true
          storageClass: ''
          size: "50Gi"
          volumeAccessMode: ReadWriteOnce
    
  2. Apply the manifest:

    $ kubectl apply -n nim-service -f nimcache.yaml
    

3. Deploy NIM through KServe as a Raw Deployment#

  1. Create a file, such as nimservice.yaml, with contents like the following sample manifest:

    apiVersion: apps.nvidia.com/v1alpha1
    kind: NIMService
    metadata:
      name: meta-llama-3-2-1b-instruct
      namespace: nim-service
    spec:
      inferencePlatform: kserve
      annotations:
        serving.kserve.io/deploymentMode: 'RawDeployment'
      scale:
        enabled: true
        hpa:
          minReplicas: 1
          maxReplicas: 3
          metrics:
          - type: "Resource"
            resource:
              name: "cpu"
              target:
                type: "Utilization"
                averageUtilization: 80
      image:
        repository: nvcr.io/nim/meta/llama-3.2-1b-instruct
        tag: "1.12.0"
        pullPolicy: IfNotPresent
        pullSecrets:
          - ngc-secret
      authSecret: ngc-api-secret
      storage:
        nimCache:
          name: meta-llama-3-2-1b-instruct
      replicas: 1
      resources:
        limits:
          nvidia.com/gpu: 1
          cpu: "12"
          memory: 32Gi
        requests:
          nvidia.com/gpu: 1
          cpu: "4"
          memory: 6Gi
      expose:
        service:
          type: ClusterIP
          port: 8000
    
  2. Apply the manifest for raw deployment:

    $ kubectl create -f nimservice.yaml -n nim-service
    
  3. Verify that the inference service has been created:

    1. View inference service details:

      $ kubectl get inferenceservice -n nim-service -o yaml
      
      Example output
        apiVersion: v1
        items:
        - apiVersion: serving.kserve.io/v1beta1
          kind: InferenceService
          metadata:
            annotations:
              nvidia.com/last-applied-hash: 757475558b
              nvidia.com/parent-spec-hash: b978f49f7
              openshift.io/required-scc: nonroot
              serving.kserve.io/autoscalerClass: hpa
              serving.kserve.io/deploymentMode: RawDeployment
              serving.kserve.io/enable-metric-aggregation: "true"
              serving.kserve.io/enable-prometheus-scraping: "true"
              temp: temp-1
            creationTimestamp: "2025-07-25T14:58:06Z"
            finalizers:
            - inferenceservice.finalizers
            generation: 1
            labels:
              app.kubernetes.io/instance: meta-llama-3-2-1b-instruct
              app.kubernetes.io/managed-by: k8s-nim-operator
              app.kubernetes.io/name: meta-llama-3-2-1b-instruct
              app.kubernetes.io/operator-version: ""
              app.kubernetes.io/part-of: nim-service
              networking.kserve.io/visibility: cluster-local
              temp2: temp-2
            name: meta-llama-3-2-1b-instruct
            namespace: nim-service
            ownerReferences:
            - apiVersion: apps.nvidia.com/v1alpha1
              blockOwnerDeletion: true
              controller: true
              kind: NIMService
              name: meta-llama-3-2-1b-instruct
              uid: aaa7e95a-81e4-404a-a1de-6dce898f9937
            resourceVersion: "43983337"
            uid: c92a7aa9-4954-4959-86b3-ad8aa6c39ca8
          spec:
            predictor:
              annotations:
                openshift.io/required-scc: nonroot
                serving.kserve.io/deploymentMode: RawDeployment
                temp: temp-1
              containers:
              - env:
                - name: MY_ENV
                  value: my-value
                - name: NIM_CACHE_PATH
                  value: /model-store
                - name: NGC_API_KEY
                  valueFrom:
                    secretKeyRef:
                      key: NGC_API_KEY
                      name: ngc-api-secret
                - name: OUTLINES_CACHE_DIR
                  value: /tmp/outlines
                - name: NIM_SERVER_PORT
                  value: "8000"
                - name: NIM_HTTP_API_PORT
                  value: "8000"
                - name: NIM_JSONL_LOGGING
                  value: "1"
                - name: NIM_LOG_LEVEL
                  value: INFO
                image: nvcr.io/nim/meta/llama-3.2-1b-instruct:1.8
                imagePullPolicy: IfNotPresent
                livenessProbe:
                  failureThreshold: 3
                  httpGet:
                    path: /v1/health/live
                    port: api
                  initialDelaySeconds: 15
                  periodSeconds: 10
                  successThreshold: 1
                  timeoutSeconds: 1
                name: kserve-container
                ports:
                - containerPort: 8000
                  name: api
                  protocol: TCP
                readinessProbe:
                  failureThreshold: 3
                  httpGet:
                    path: /v1/health/ready
                    port: api
                  initialDelaySeconds: 15
                  periodSeconds: 10
                  successThreshold: 1
                  timeoutSeconds: 1
                resources:
                  limits:
                    cpu: "12"
                    memory: 32Gi
                    nvidia.com/gpu: "1"
                  requests:
                    cpu: "12"
                    memory: 32Gi
                    nvidia.com/gpu: "1"
                startupProbe:
                  failureThreshold: 30
                  httpGet:
                    path: /v1/health/ready
                    port: api
                  initialDelaySeconds: 30
                  periodSeconds: 10
                  successThreshold: 1
                  timeoutSeconds: 1
                volumeMounts:
                - mountPath: /model-store
                  name: model-store
                - mountPath: /dev/shm
                  name: dshm
              deploymentStrategy:
                rollingUpdate:
                  maxSurge: 0
                  maxUnavailable: 25%
                type: RollingUpdate
              imagePullSecrets:
              - name: ngc-secret
              labels:
                app: meta-llama-3-2-1b-instruct
              maxReplicas: 3
              minReplicas: 1
              scaleMetric: cpu
              scaleMetricType: Utilization
              scaleTarget: 80
              securityContext:
                fsGroup: 2500
                runAsGroup: 2500
                runAsUser: 2500
              serviceAccountName: meta-llama-3-2-1b-instruct
              volumes:
              - emptyDir:
                  medium: Memory
                name: dshm
              - name: model-store
                persistentVolumeClaim:
                  claimName: meta-llama-3-2-1b-instruct-pvc
          status:
            address:
              url: http://meta-llama-3-2-1b-instruct-predictor.nim-service.svc.cluster.local
            components:
              predictor:
                url: http://meta-llama-3-2-1b-instruct-predictor-nim-service.example.com
            conditions:
            - lastTransitionTime: "2025-07-25T14:58:06Z"
              status: "True"
              type: IngressReady
            - lastTransitionTime: "2025-07-25T14:58:06Z"
              status: "True"
              type: PredictorReady
            - lastTransitionTime: "2025-07-25T14:58:06Z"
              status: "True"
              type: Ready
            deploymentMode: RawDeployment
            modelStatus:
              copies:
                failedCopies: 0
                totalCopies: 1
              states:
                activeModelState: Loaded
                targetModelState: Loaded
              transitionStatus: UpToDate
            observedGeneration: 1
            url: http://meta-llama-3-2-1b-instruct-nim-service.example.com
        kind: List
        metadata:
          resourceVersion: ""
      
    2. View inference service status:

      $ kubectl get inferenceservice -n nim-service
      
      Example output
      NAME                         URL                                                          READY   AGE
      meta-llama-3-2-1b-instruct   http://meta-llama-3-2-1b-instruct-nim-service.example.com    True    100s
      
    3. View NIM Service status:

      $ kubectl get nimservice -n nim-service -o json | jq .items[0].status
      
      Example output
      {
        "conditions": [
          {
            "lastTransitionTime": "2025-07-25T15:02:09Z",
            "message": "",
            "reason": "Ready",
            "status": "True",
            "type": "Ready"
          },
          {
            "lastTransitionTime": "2025-07-25T14:58:06Z",
            "message": "",
            "reason": "Ready",
            "status": "False",
            "type": "Failed"
          }
        ],
        "model": {
          "clusterEndpoint": "http://meta-llama-3-2-1b-instruct-predictor.nim-service.svc.cluster.local",
          "externalEndpoint": "http://meta-llama-3-2-1b-instruct-nim-service.example.com",
          "name": "meta/llama-3.2-1b-instruct"
        },
        "state": "Ready"
      }
      
  4. Verify that the HPA has been created:

    $ kubectl get hpa -n nim-service meta-llama-3-2-1b-instruct-predictor -o yaml
    
  1. Create a file, such as nimservice.yaml, with contents like the following sample manifest:

    apiVersion: apps.nvidia.com/v1alpha1
    kind: NIMService
    metadata:
      name: nim-service-multi-llm
      namespace: nim-service
    spec:
      inferencePlatform: kserve
      annotations:
        serving.kserve.io/deploymentMode: 'RawDeployment'
      image:
        repository: nvcr.io/nim/nvidia/llm-nim
        tag: "1.12"
        pullPolicy: IfNotPresent
        pullSecrets:
          - ngc-secret
      authSecret: ngc-api-secret
      storage:
        nimCache:
          name: nim-cache-multi-llm
          profile: 'tensorrt_llm'
      resources:
        limits:
          nvidia.com/gpu: 1
          cpu: "12"
          memory: 32Gi
        requests:
          nvidia.com/gpu: 1
          cpu: "4"
          memory: 6Gi
      expose:
        service:
          type: ClusterIP
          port: 8000
    
  2. Apply the manifest for raw deployment:

    $ kubectl create -f nimservice.yaml -n nim-service
    
  3. Verify that the inference service has been created:

    $ kubectl get inferenceservice -n nim-service nim-service-multi-llm -o yaml
    

Raw Deployment Example on Red Hat OpenShift#

Summary#

For raw deployment of NIM through KServe using Red Hat OpenShift, follow these steps:

  1. Install KServe using OpenShift interface.

  2. Create NIM Cache using OpenShift.

  3. Deploy NIM through KServe using OpenShift.

1. Install KServe in Raw Deployment Mode Using OpenShift#

  1. Follow the instructions on the Red Hat website for installing the single-model serving platform to install the OpenShift AI Operator.

    Figure 1. OpenShift web console

    Figure 2. Interface to install OpenShift AI Operator

    Follow the steps for standard deployment (OpenShift’s term for raw deployment mode). Select the following settings:

    • Update channel: stable

    • Version: 2.22.1

    • Operator recommended namespace: redhat-ods-operator

    Note

    There is also an advanced mode, which is equivalent to serverless; however, the NIM Operator does not support it in this release.

  2. Create an instance of Data Science Cluster (DSC)

    Figure 3. Interface to create instance of DSC

    Use the following YAML to create the DSC:

    kind: DataScienceCluster
    metadata:
      labels:
        app.kubernetes.io/created-by: rhods-operator
        app.kubernetes.io/instance: default-dsc
        app.kubernetes.io/managed-by: kustomize
        app.kubernetes.io/name: datasciencecluster
        app.kubernetes.io/part-of: rhods-operator
      name: default-dsc
    spec:
      components:
        codeflare:
          managementState: Managed
        kserve:
          defaultDeploymentMode: RawDeployment
          managementState: Managed
          nim:
            managementState: Managed
          rawDeploymentServiceConfig: Headed
          serving:
            ingressGateway:
              certificate:
                type: OpenshiftDefaultIngress
            managementState: Removed
            name: knative-serving
        modelregistry:
          registriesNamespace: rhoai-model-registries
        feastoperator: {}
        trustyai: {}
        ray: {}
        kueue: {}
        workbenches:
          workbenchNamespace: rhods-notebooks
        dashboard: {}
        modelmeshserving: {}
        llamastackoperator: {}
        datasciencepipelines: {}
        trainingoperator: {}
    
  3. Create an instance of DSCInitialization (DSI):

    Figure 4. Interface to create instance of DSI

    Use the following YAML to create the DSI:

    apiVersion: dscinitialization.opendatahub.io/v1
    kind: DSCInitialization
    metadata:
      name: default-dsci
    spec:
      applicationsNamespace: redhat-ods-applications
      serviceMesh:
        controlPlane:
          metricsCollection: Istio
          name: data-science-smcp
          namespace: istio-system
        managementState: Removed
    
  4. Verify the KServe controller is running:

    Click on Workloads > Pods > redhat-ods-applications project.

2. Create NIM Cache Using OpenShift#

Note

Refer to prerequisites for more information on using NIM Cache.

  1. Create a file, such as nimcache.yaml, with contents like the following example:

    # NIM Cache for OpenShift
    apiVersion: apps.nvidia.com/v1alpha1
    kind: NIMCache
    metadata:
      name: meta-llama3-1b-instruct
      namespace: nim-service
    spec:
      source:
        ngc:
          authSecret: ngc-api-secret
          model:
            engine: tensorrt_llm
            tensorParallelism: '1'
          modelPuller: 'nvcr.io/nim/meta/llama-3.2-1b-instruct:1.8.3'
          pullSecret: ngc-secret
      storage:
        pvc:
          create: true
          size: 50Gi
          volumeAccessMode: ReadWriteOnce
    
  2. Apply the manifest:

    $ oc apply -n nim-service -f nimcache.yaml
    

3. Deploy NIM Through KServe as a Raw Deployment Using OpenShift#

  1. Create a file, such as nimservice.yaml, with contents like the following example:

    # NIM Service Raw Deployment Using OpenShift
    apiVersion: apps.nvidia.com/v1alpha1
    kind: NIMService
    metadata:
      name: meta-llama-3-2-1b-instruct
      namespace: nim-service
    spec:
      annotations:
        serving.kserve.io/deploymentMode: RawDeployment
      expose:
        service:
          port: 8000
          type: ClusterIP
      scale:
        enabled: true
        hpa:
          maxReplicas: 3
          metrics:
            - resource:
                name: cpu
                target:
                  averageUtilization: 80
                  type: Utilization
              type: Resource
          minReplicas: 1
      inferencePlatform: kserve
      authSecret: ngc-api-secret
      image:
        pullPolicy: IfNotPresent
        pullSecrets:
          - ngc-secret
        repository: nvcr.io/nim/meta/llama-3.2-1b-instruct
        tag: 1.8.3
      storage:
        nimCache:
          name: meta-llama3-1b-instruct
      resources:
        limits:
          nvidia.com/gpu: 1
          cpu: "12"
          memory: 32Gi
        requests:
          nvidia.com/gpu: 1
          cpu: "4"
          memory: 6Gi
      replicas: 1
    
  2. Apply the manifest for raw deployment:

    $ oc create -f nimservice.yaml -n nim-service
    
  3. Verify that the inference service has been created:

    1. View inference service details:

      $ oc get inferenceservice -n nim-service -o json | jq .items[0].status
      
      Example output
      {
        "address": {
          "url": "http://meta-llama-3-2-1b-instruct-predictor.nim-service.svc.cluster.local"
        },
        "components": {
          "predictor": {}
        },
        "conditions": [
          {
            "lastTransitionTime": "2025-08-15T05:59:00Z",
            "status": "True",
            "type": "IngressReady"
          },
          {
            "lastTransitionTime": "2025-08-15T05:59:00Z",
            "status": "True",
            "type": "PredictorReady"
          },
          {
            "lastTransitionTime": "2025-08-15T05:59:00Z",
            "status": "True",
            "type": "Ready"
          },
          {
            "lastTransitionTime": "2025-08-15T05:58:59Z",
            "severity": "Info",
            "status": "False",
            "type": "Stopped"
          }
        ],
        "deploymentMode": "RawDeployment",
        "modelStatus": {
          "copies": {
            "failedCopies": 0,
            "totalCopies": 1
          },
          "states": {
            "activeModelState": "Loaded",
            "targetModelState": "Loaded"
          },
          "transitionStatus": "UpToDate"
        },
        "observedGeneration": 1,
        "url": "http://meta-llama-3-2-1b-instruct-predictor.nim-service.svc.cluster.local"
      }
      
    2. View NIM Service status:

      $ oc get nimservice -n nim-service meta-llama-3-2-1b-instruct  -o json | jq .status
      
      Example output
      {
        "conditions": [
          {
            "lastTransitionTime": "2025-08-15T05:59:50Z",
            "message": "",
            "reason": "Ready",
            "status": "True",
            "type": "Ready"
          },
          {
            "lastTransitionTime": "2025-08-15T05:58:59Z",
            "message": "",
            "reason": "Ready",
            "status": "False",
            "type": "Failed"
          }
        ],
        "model": {
          "clusterEndpoint": "http://meta-llama-3-2-1b-instruct-predictor.nim-service.svc.cluster.local",
          "externalEndpoint": "http://meta-llama-3-2-1b-instruct-predictor.nim-service.svc.cluster.local",
          "name": "meta/llama-3.2-1b-instruct"
        },
        "state": "Ready"
      }
      
  4. Run inference:

    $ oc -n nim-service run curltest --rm -i --image=curlimages/curl --restart=Never -- \
      curl -s http://meta-llama-3-2-1b-instruct-predictor.nim-service.svc.cluster.local/v1/chat/completions \
      -H 'Content-Type: application/json' \
      -d '{"model":"meta/llama-3.2-1b-instruct","messages":[{"role":"user","content":"Hello!"}]}'
    
    Example output
    {"id":"chat-2d3c821536514598b931783033a7e7e7","object":"chat.completion","created":1755239534,"model":"meta/llama-3.2-1b-instruct","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! How can I help you today?"},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":12,"total_tokens":21,"completion_tokens":9},"prompt_logprobs":null}
    

Serverless Deployment Example on Standard Kubernetes#

Summary#

For serverless deployment of NIM through KServe on a standard Kubernetes installation, follow these steps:

  1. Install KServe in Serverless Mode.

  2. Enable PVC support.

  3. Optional: Create NIM Cache with Serverless Deployment.

  4. Deploy NIM through KServe as a Serverless Deployment.

1. Install KServe in Serverless Mode#

Note

For security, consider downloading and reviewing the script before execution in production environments.

Run the following command to execute the KServe quick install script:

$ curl -s "https://raw.githubusercontent.com/kserve/kserve/release-0.15/hack/quick_install.sh" | bash

For more information, refer to Getting Started with KServe.

The following components are deployed by the KServe quick install script:

Component

Description

KServe

  • Installs KServe CRDs, such as InferenceService.

  • Installs KServe controller in the kserve namespace.

  • Provides model serving, autoscaling, and inference management on Kubernetes.

Gateway API CRDs

  • Installs the Gateway API CustomResourceDefinitions: GatewayClass, Gateway, GRPCRoute, HTTPRoute, ReferenceGrant.

  • Provides modern networking primitives for routing traffic into services.

Istio (Service Mesh)

Deployed into the istio-system namespace with Helm charts:

  • istio-base: core Istio CRDs and cluster-scoped resources.

  • istiod: Istio control plane (pilot, configuration, service discovery).

  • istio-ingressgateway: data plane ingress gateway for external traffic.

Cert-Manager

  • Installed in the cert-manager namespace.

  • Handles TLS certificate provisioning and management.

  • Required for automatic HTTPS and secure communication.

Knative

  • Installs the Knative Operator in knative-serving namespace.

  • Deploys a KnativeServing custom resource.

  • Provides serverless deployment and scaling primitives used by KServe.

Note

To uninstall KServe, follow the instructions in Uninstalling KServe.

2. Enable PVC Support#

Run the following command to enable persistent volume support:

$ kubectl patch --namespace knative-serving configmap/config-features \
  --type merge \
  --patch '{"data":{"kubernetes.podspec-persistent-volume-claim": "enabled", "kubernetes.podspec-persistent-volume-write": "enabled"}}'

These extensions enable persistent volume support:

  • kubernetes.podspec-persistent-volume-claim: Enables persistent volumes (PVs) with Knative Serving

  • kubernetes.podspec-persistent-volume-write: Provides write access to those PVs

3. Optional: Create NIM Cache With Serverless Deployment#

Note

Refer to prerequisites for more information on using NIM Cache.

  1. Create a file, such as nimcache.yaml, with contents like the following sample manifest:

    apiVersion: apps.nvidia.com/v1alpha1
    kind: NIMCache
    metadata:
      labels:
        app.kubernetes.io/name: k8s-nim-operator
      name: meta-llama-3-2-1b-instruct
      namespace: nim-service
    spec:
      source:
        ngc:
          modelPuller: nvcr.io/nim/meta/llama-3.2-1b-instruct:1.12.0
          pullSecret: ngc-secret
          authSecret: ngc-api-secret
          model:
            engine: "tensorrt"
            tensorParallelism: "1"
      storage:
        pvc:
          create: true
          size: "50Gi"
          volumeAccessMode: ReadWriteOnce
    
  2. Apply the manifest:

    $ kubectl apply -n nim-service -f nimcache.yaml
    

4. Deploy NIM through KServe as a Serverless Deployment#

Note

Autoscaling and ingress are handled directly by KServe, not through NIMService configuration.

  1. Create a file, such as nimservice.yaml, with contents like the following sample manifest:

    apiVersion: apps.nvidia.com/v1alpha1
    kind: NIMService
    metadata:
      name: meta-llama-3-2-1b-instruct
      namespace: nim-service
    spec:
      inferencePlatform: kserve
      annotations:
        # Knative concurrency-based autoscaling (default).
        autoscaling.knative.dev/class: kpa.autoscaling.knative.dev
        autoscaling.knative.dev/metric: concurrency
        # Target 10 requests in-flight per pod.
        autoscaling.knative.dev/target: "10"
        # Disable scale to zero with a min scale of 1.
        autoscaling.knative.dev/min-scale: "1"
        # Limit scaling to 100 pods.
        autoscaling.knative.dev/max-scale: "10"
      image:
        repository: nvcr.io/nim/meta/llama-3.2-1b-instruct
        tag: "1.12.0"
        pullPolicy: IfNotPresent
        pullSecrets:
          - ngc-secret
      authSecret: ngc-api-secret
      storage:
        nimCache:
          name: meta-llama-3-2-1b-instruct
          profile: ''
      resources:
        limits:
          nvidia.com/gpu: 1
          cpu: "12"
          memory: 32Gi
        requests:
          nvidia.com/gpu: 1
          cpu: "4"
          memory: 6Gi
      replicas: 1
      expose:
        service:
          type: ClusterIP
          port: 8000
    
  2. Apply the manifest for serverless deployment:

    $ kubectl create -f nimservice.yaml -n nim-service
    
  3. Verify that the inference service has been created:

    $ kubectl get inferenceservice -n nim-service meta-llama-3-2-1b-instruct-serverless -o yaml
    

Uninstalling KServe#

If you installed KServe using the https://raw.githubusercontent.com/kserve/kserve/release-0.15/hack/quick_install.sh quick install script, use the following commands to uninstall it:

helm uninstall --ignore-not-found istio-ingressgateway -n istio-system
helm uninstall --ignore-not-found istiod -n istio-system
helm uninstall --ignore-not-found istio-base -n istio-system
echo "πŸ˜€ Successfully uninstalled Istio"

helm uninstall --ignore-not-found cert-manager -n cert-manager
echo "πŸ˜€ Successfully uninstalled Cert Manager"

helm uninstall --ignore-not-found keda -n keda
echo "πŸ˜€ Successfully uninstalled KEDA"

kubectl delete --ignore-not-found=true KnativeServing knative-serving -n knative-serving --wait=True --timeout=300s || true
helm uninstall --ignore-not-found knative-operator -n knative-serving
echo "πŸ˜€ Successfully uninstalled Knative"

helm uninstall --ignore-not-found kserve -n kserve
helm uninstall --ignore-not-found kserve-crd -n kserve
echo "πŸ˜€ Successfully uninstalled KServe"

kubectl delete --ignore-not-found=true namespace istio-system
kubectl delete --ignore-not-found=true namespace cert-manager
kubectl delete --ignore-not-found=true namespace kserve