Managing NIM Services#

About NIM Services#

A NIM service is a Kubernetes custom resource, nimservices.apps.nvidia.com. You create and delete NIM service resources to manage NVIDIA NIM microservices.

Refer to the following sample manifest:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama3-8b-instruct
spec:
  image:
    repository: nvcr.io/nim/meta/llama-3.1-8b-instruct
    tag: 1.3.3
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: meta-llama3-8b-instruct
      profile: ''
  replicas: 1
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000

Refer to the following table for information about the commonly modified fields:

Field	Description	Default Value
`spec.annotations`	Specifies to add the user-supplied annotations to the pod.	None
`spec.authSecret` (required)	Specifies the name of a generic secret that contains NGC_API_KEY. Learn more about image pull secrets.	None
`spec.env`	Specifies environment variables to set in the NIM microservice container.	None
`spec.expose.ingress.enabled`	When set to `true`, the Operator creates a Kubernetes Ingress resource for the NIM microservice. Specify the ingress specification in the `spec.expose.ingress.spec` field. If you have an ingress controller, values like the following sample configures an ingress for the `v1/chat/completions` endpoint. ingress: enabled: true spec: ingressClassName: nginx rules: - host: demo.nvidia.example.com http: paths: - backend: service: name: meta-llama3-8b-instruct port: number: 8000 path: /v1/chat/completions pathType: Prefix	`false`
`spec.expose.service.grpcPort`	Specifies the Triton Inference Server GRPC service port number. Only use this port for non-LLM NIM microservice running a Triton GRPC Inference Server.	None
`spec.expose.service.metricsPort`	Specifies the Triton Inference Server metrics port number for a non-LLM NIM microservice. Only use this port for non-LLM NIMs running a separate Triton Inference Server metrics endpoint.	None
`spec.expose.service.port` (required)	Specifies the network port number for the NIM microservice. The frequently used value is `8000`.	None
`spec.expose.service.type`	Specifies the Kubernetes service type to create for the NIM microservice.	`ClusterIP`
`spec.groupID`	Specifies the group for the pods. This value is used to set the security context of the pod in the `runAsGroup` and `fsGroup` fields.	`2000`
`spec.image`	Specifies repository, tag, pull policy, and pull secret for the container image.	None
`spec.labels`	Specifies the user-supplied labels to add to the pod.	None
`spec.metrics.enabled`	When set to `true`, the Operator configures a Prometheus service monitor for the service. Specify the service monitor specification in the `spec.metrics.serviceMonitor` field.	`false`
`spec.nodeSelector`	Specifies node selector labels to schedule the service.	None
`spec.proxy.certConfigMap`	Specifies the name of the ConfigMap with CA certs for your proxy server.	None
`spec.proxy.httpProxy`	Specifies the address of a proxy server that should be used for outbound HTTP requests.	None
`spec.proxy.httpsProxy`	Specifies the address of a proxy server that should be used for outbound HTTPS requests	None
`spec.proxy.noProxy`	Specifies a comma-separated list of domain names, IP addresses, or IP ranges for which proxying should be bypassed.	None
`spec.resources`	Specifies the resource requirements for the pods.	None
`spec.replicas`	Specifies the desired number of pods in the replica set for the NIM microservice.	`1`
`spec.runtimeClassName`	Specifies the underlying container runtime class name to be used for running NIMs with NVIDIA GPUs allocated. If not set, the default `nvidia` runtime class is assigned automatically. This runtime class is created by the NVIDIA GPU Operator.	None
`spec.scale.enabled`	When set to `true`, the Operator creates a Kubernetes horizontal pod autoscaler for the NIM microservice. Specify the HPA specification in the `spec.scale.hpa` field. The `spec.scale.hpa` field supports the following subfields: `minReplicas`, `maxReplicas`, `metrics`, and `behavior`. These fields correspond to the same fields in a horizontal pod autoscaler resource specification.	`false`
`spec.storage.nimCache`	Specifies the name of the NIM cache that has the cached model profiles for the NIM microservice. Specify values for the `name` subfield and optionally, the `profile` subfield. This field has precedence over the `spec.storage.pvc` field. Refer to Displaying Cached Model Profiles for details on viewing available models profile names in a NIM cache.	None
`spec.storage.pvc`	If you did not create a NIM cache resource to download and cache your model, you can specify this field to download model profiles. This field has the following subfields: `annotations`, `create`, `name`, `size`, `storageClass`, `volumeAccessMode`, and `subPath`. To have the Operator create a PVC for the model profiles, specify `pvc.create: true`. Refer to Example: Create a PVC Instead of Using a NIM Cache.	None
`spec.storage.readOnly`	When set to `true`, the Operator mounts the PVC from either the `pvc` or `nimCache` specification as read-only.	`false`
`spec.storage.sharedMemorySizeLimit`	Specifies the max size of the shared memory volume (emptyDir) used by NIMs for fast model runtime read and write operations. If not specified, the NIM Operator will create an emptyDir with no limit.	None
`spec.schedulerName`	Specifies the custom scheduler to use for NIM deployments. If no scheduler is specified, then your configured default scheduler is used. This could be the Kubernetes `default-scheduler`, or a custom default scheduler, for example if you’ve configured Run:ai as the default scheduler for the NIM Operator namespace.	Kubernetes default scheduler.
`spec.tolerations`	Specifies the tolerations for the pods.	None
`spec.userID`	Specifies the user ID for the pod. This value is used to set the security context of the pod in the `runAsUser` fields.	`1000`

Prerequisites#

Optional: Added a NIM cache resource for the NIM microservice. If you created a NIM cache resource, specify the name in the spec.nimCache.name field.

If you prefer to have the service download a model to storage, refer to Example: Create a PVC Instead of Using a NIM Cache for a sample manifest.

Procedure#

Create a file, such as service-all.yaml, with contents like the following example:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama3-8b-instruct
spec:
  image:
    repository: nvcr.io/nim/meta/llama-3.1-8b-instruct
    tag: 1.3.3
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: meta-llama3-8b-instruct
      profile: ''
  replicas: 1
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000
---
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: nv-embedqa-1b-v2
spec:
  image:
    repository: nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2
    tag: 1.3.1
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: nv-embedqa-1b-v2
      profile: ''
  replicas: 1
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000
---
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: nv-rerankqa-1b-v2
spec:
  image:
    repository: nvcr.io/nim/nvidia/llama-3.2-nv-rerankqa-1b-v2
    tag: 1.3.1
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: nv-rerankqa-1b-v2
      profile: ''
  replicas: 1
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000

Apply the manifest:

$ kubectl apply -n nim-service -f service-all.yaml

Optional: View information about the NIM services:

$ kubectl describe nimservices.apps.nvidia.com -n nim-service

Partial Output

...
Conditions:
 Last Transition Time:  2024-08-12T19:09:43Z
 Message:               Deployment is ready
 Reason:                Ready
 Status:                True
 Type:                  Ready
 Last Transition Time:  2024-08-12T19:09:43Z
 Message:
 Reason:                Ready
 Status:                False
 Type:                  Failed
State:                  Ready

Verification#

Start a pod that has access to the curl command. Substitute any pod that has the command and meets your organization’s security requirements:
```
$ kubectl run --rm -it -n default curl --image=curlimages/curl:latest -- ash
```
After the pod starts, you are connected to the ash shell in the pod.

Connect to the chat completions endpoint on the NIM for LLMs container.

The command connects to the service in the nim-service namespace, meta-llama3-8b-instruct.nim-service. The command specifies the model to use, meta/llama3.1-8b-instruct. Replace these values if you use a different service name, namespace, or model. Find the model name by Displaying Cached Models.

curl -X "POST" \
 'http://meta-llama3-8b-instruct.nim-service:8000/v1/chat/completions' \
  -H 'Accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
        "model": "meta/llama-3.1-8b-instruct",
        "messages": [
        {
          "content":"What should I do for a 4 day vacation at Cape Hatteras National Seashore?",
          "role": "user"
        }],
        "top_p": 1,
        "n": 1,
        "max_tokens": 1024,
        "stream": false,
        "frequency_penalty": 0.0,
        "stop": ["STOP"]
      }'

Press Ctrl+D to exit and delete the pod.

Displaying Cached Models#

View the .status.model field of the custom resource. replace meta-llama3-8b-instruct with the name of your NIM service.

$ kubectl get nimservice.apps.nvidia.com -n nim-service \
    meta-llama3-8b-instruct  -o=jsonpath="{.status.model}" | jq .

Example Output

  {
    "clusterEndpoint": "",
    "externalEndpoint": "",
    "name": "meta/llama-3.1-8b-instruct"
  }

Configuring Horizontal Pod Autoscaling#

Prerequisites#

Prometheus installed on your cluster. Refer to the Observiability page for details on installing and configuring Promethues for the NIM Operator.

Autoscaling NIM for LLMs#

NVIDIA NIM for LLMs provides several service metrics. Refer to Observability in the NVIDIA NIM for LLMs documentation for information about the metrics.

Create a file, such as service-hpa.yaml, or update your NIMService manifest to include spec.metrics and your spec.scale configuration. The service-hpa.yaml uses the following example:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama3-8b-instruct
spec:
  image:
    repository: nvcr.io/nim/meta/llama-3.1-8b-instruct
    tag: 1.3.3
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: meta-llama3-8b-instruct
      profile: ''
  replicas: 1
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000
  metrics:
    enabled: true
    serviceMonitor:
      additionalLabels:
        release: kube-prometheus-stack
  scale:
    enabled: true
    hpa:
      maxReplicas: 2
      minReplicas: 1
      metrics:
      - type: Object
        object:
          metric:
            name: gpu_cache_usage_perc
          describedObject:
            apiVersion: v1
            kind: Service
            name: meta-llama3-8b-instruct
          target:
            type: Value
            value: "0.5"

Apply the manifest:

$ kubectl apply -n nim-service -f service-hpa.yaml

Annotate the service resource related to NIM for LLMs:
```
$ kubectl annotate -n nim-service svc meta-llama3-8b-instruct prometheus.io/scrape=true
```
Prometheus might require several minutes to begin collecting metrics from the service.

Optional: Confirm Prometheus collects the metrics.

If you have access to the Prometheus dashboard, search for a service metric such as gpu_cache_usage_perc.

You can query Prometheus Adapter:

$ kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/nim-service/services/*/gpu_cache_usage_perc" | jq .

Example Output

{
  "kind": "MetricValueList",
  "apiVersion": "custom.metrics.k8s.io/v1beta1",
  "metadata": {},
  "items": [
    {
      "describedObject": {
        "kind": "Service",
        "namespace": "nim-service",
        "name": "meta-llama3-8b-instruct",
        "apiVersion": "/v1"
      },
      "metricName": "gpu_cache_usage_perc",
      "timestamp": "2024-09-12T15:14:20Z",
      "value": "0",
      "selector": null
    }
  ]
}

Optional: Confirm the horizontal pod autoscaler resource is created:

$ kubectl get hpa -n nim-service

Example Output

NAME                      REFERENCE                            TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
meta-llama3-8b-instruct   Deployment/meta-llama3-8b-instruct   0/500m    1         2         1          40s

Sample Manifests#

Example: Create a PVC Instead of Using a NIM Cache#

As an alternative to creating NIM cache resources to download and cache NIM model profiles, you can specify that the Operator create a PVC and the NIM service downloads and runs a NIM model profile.

Create and apply a manifest like the following example:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama3-8b-instruct
spec:
  image:
    repository: nvcr.io/nim/meta/llama3-8b-instruct
    tag: 1.0.3
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  replicas: 1
  storage:
    pvc:
      create: true
      storageClass: <storage-class-name>
      size: 10Gi
      volumeAccessMode: ReadWriteMany
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000

Example: Air-Gapped Environment#

For air-gapped environments, you must download the model profiles for the NIM microservices from a host that has internet access. You must manually create a PVC and then transfer the model profile files into the PVC.

Typically, the Operator determines the PVC name by dereferencing it from the NIM cache resource. When there is no NIM cache resource, such as an air-gapped environment, you must specify the PVC name.

Create and apply a manifest like the following example:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama3-8b-instruct
spec:
  image:
    repository: nvcr.io/nim/meta/llama3-8b-instruct
    tag: 1.0.3
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  replicas: 1
  storage:
    pvc:
      name: <existing-pvc-name>
    readOnly: true
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000

Deleting a NIM Service#

To delete a NIM service perform the following steps.

View the NIM services custom resources:

$ kubectl get nimservices.apps.nvidia.com -A

Example Output

NAMESPACE     NAME                        STATUS   AGE
nim-service   meta-llama3-8b-instruct     Ready    2024-08-12T17:16:05Z

Delete the custom resource:

$ kubectl delete nimservice -n nim-service meta-llama3-8b-instruct

If the Operator created the PVC when you created the NIM cache, the Operator deletes the PVC and the cached model profiles. You can determine if the Operator created the PVC by running a command like the following example:

$ kubectl get nimcaches.apps.nvidia.com -n nim-service \
   -o=jsonpath='{range .items[*]}{.metadata.name}: {.spec.storage.pvc.create}{"\n"}{end}'

Example Output

meta-llama3-8b-instruct: true

Next Steps#

Deploy applications to use the NIM services, such as the Sample RAG Application.