Managing NIM Services

About NIM Services

A NIM service is a Kubernetes custom resource, nimservices.apps.nvidia.com. You create and delete NIM service resources to manage NVIDIA NIM microservices.

Refer to the following sample manifest:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama3-8b-instruct
spec:
  image:
    repository: nvcr.io/nim/meta/llama3-8b-instruct
    tag: 1.0.3
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: meta-llama3-8b-instruct
      profile: ''
  replicas: 1
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000

Refer to the following table for information about the commonly modified fields:

Field

Description

Default Value

spec.annotations

Specifies to add the user-supplied annotations to the pod.

None

spec.authSecret (required)

Specifies the name of a generic secret that contains NGC_API_KEY.

None

spec.env

Specifies environment variables to set in the NIM microservice container.

None

spec.expose.ingress.enabled

When set to true, the Operator creates a Kubernetes Ingress resource for the NIM microservice. Specify the ingress specification in the spec.expose.ingress.spec field.

If you have an ingress controller, values like the following sample configures an ingress for the v1/chat/completions endpoint.

ingress:
  enabled: true
  spec:
    ingressClassName: nginx
    rules:
      - host: demo.nvidia.example.com
        http:
          paths:
          - backend:
              service:
                name: meta-llama3-8b-instruct
                port:
                  number: 8000
            path: /v1/chat/completions
            pathType: Prefix

false

spec.expose.service.port (required)

Specifies the network port number for the NIM microservice. The frequently used value is 8000.

None

spec.expose.service.type

Specifies the Kubernetes service type to create for the NIM microservice.

ClusterIP

spec.groupID

Specifies the group for the pods. This value is used to set the security context of the pod in the runAsGroup and fsGroup fields.

2000

spec.image

Specifies repository, tag, pull policy, and pull secret for the container image.

None

spec.labels

Specifies the user-supplied labels to add to the pod.

None

spec.metrics.enabled

When set to true, the Operator configures a Prometheus service monitor for the service. Specify the service monitor specification in the spec.metrics.serviceMonitor field.

false

spec.resources

Specifies the resource requirements for the pods.

None

spec.replicas

Specifies the desired number of pods in the replica set for the NIM microservice.

1

spec.scale.enabled

When set to true, the Operator creates a Kubernetes horizontal pod autoscaler for the NIM microservice. Specify the HPA specification in the spec.scale.hpa field.

The spec.scale.hpa field supports the following subfields: minReplicas, maxReplicas, metrics, and behavior. These fields correspond to the same fields in a horizontal pod autoscaler resource specification.

false

spec.storage.nimCache

Specifies the name of the NIM cache that has the cached model profiles for the NIM microservice. Specify values for the name subfield and optionally, the profile subfield. This field has precedence over the spec.storage.pvc field.

None

spec.storage.pvc

If you did not create a NIM cache resource to download and cache your model, you can specify this field to download model profiles. This field has the following subfields: create, name, size, storageClass, volumeAccessMode, and subPath.

To have the Operator create a PVC for the model profiles, specify pvc.create: true. Refer to Example: Create a PVC Instead of Using a NIM Cache.

None

spec.storage.readOnly

When set to true, the Operator mounts the PVC from either the pvc or nimCache specification as read-only.

false

spec.tolerations

Specifies the tolerations for the pods.

None

spec.userID

Specifies the user ID for the pod. This value is used to set the security context of the pod in the runAsUser fields.

1000

Prerequisites

  • Optional: Added a NIM cache resource for the NIM microservice. If you created a NIM cache resource, specify the name in the spec.nimCache.name field.

    If you prefer to have the service download a model to storage, refer to Example: Create a PVC Instead of Using a NIM Cache for a sample manifest.

Procedure

  1. Create a file, such as service-all.yaml, with contents like the following example:

    apiVersion: apps.nvidia.com/v1alpha1
    kind: NIMService
    metadata:
      name: meta-llama3-8b-instruct
    spec:
      image:
        repository: nvcr.io/nim/meta/llama3-8b-instruct
        tag: 1.0.3
        pullPolicy: IfNotPresent
        pullSecrets:
          - ngc-secret
      authSecret: ngc-api-secret
      storage:
        nimCache:
          name: meta-llama3-8b-instruct
          profile: ''
      replicas: 1
      resources:
        limits:
          nvidia.com/gpu: 1
      expose:
        service:
          type: ClusterIP
          port: 8000
    ---
    apiVersion: apps.nvidia.com/v1alpha1
    kind: NIMService
    metadata:
      name: nv-embedqa-e5-v5
    spec:
      image:
        repository: nvcr.io/nim/nvidia/nv-embedqa-e5-v5
        tag: 1.0.4
        pullPolicy: IfNotPresent
        pullSecrets:
          - ngc-secret
      authSecret: ngc-api-secret
      storage:
        nimCache:
          name: nv-embedqa-e5-v5
          profile: ''
      replicas: 1
      resources:
        limits:
          nvidia.com/gpu: 1
      expose:
        service:
          type: ClusterIP
          port: 8000
    ---
    apiVersion: apps.nvidia.com/v1alpha1
    kind: NIMService
    metadata:
      name: nv-rerankqa-mistral-4b-v3
    spec:
      image:
        repository: nvcr.io/nim/nvidia/nv-rerankqa-mistral-4b-v3
        tag: 1.0.4
        pullPolicy: IfNotPresent
        pullSecrets:
          - ngc-secret
      authSecret: ngc-api-secret
      storage:
        nimCache:
          name: nv-rerankqa-mistral-4b-v3
          profile: ''
      replicas: 1
      resources:
        limits:
          nvidia.com/gpu: 1
      expose:
        service:
          type: ClusterIP
          port: 8000
    
  2. Apply the manifest:

    $ kubectl apply -n nim-service -f service-all.yaml
    
  3. Optional: View information about the NIM services:

    $ kubectl describe nimservices.apps.nvidia.com -n nim-service
    

    Partial Output

    ...
    Conditions:
     Last Transition Time:  2024-08-12T19:09:43Z
     Message:               Deployment is ready
     Reason:                Ready
     Status:                True
     Type:                  Ready
     Last Transition Time:  2024-08-12T19:09:43Z
     Message:
     Reason:                Ready
     Status:                False
     Type:                  Failed
    State:                  Ready
    

Verification

  1. Start a pod that has access to the curl command. Substitute any pod that has the command and meets your organization’s security requirements:

    $ kubectl run --rm -it -n default curl --image=curlimages/curl:latest -- ash
    

    After the pod starts, you are connected to the ash shell in the pod.

  2. Connect to the chat completions endpoint on the NIM for LLMs container:

curl -X "POST" \
 'http://meta-llama3-8b-instruct.nim-service:8000/v1/chat/completions' \
  -H 'Accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
        "model": "meta/llama3-8b-instruct",
        "messages": [
        {
          "content":"What should I do for a 4 day vacation at Cape Hatteras National Seashore?",
          "role": "user"
        }],
        "top_p": 1,
        "n": 1,
        "max_tokens": 1024,
        "stream": false,
        "frequency_penalty": 0.0,
        "stop": ["STOP"]
      }'

The command connects to the service in the nim-service namespace, meta-llama3-8b-instruct.nim-service. The command specifies the model to use, meta/llama3-8b-instruct. Replace these values if you use a different service name, namespace, or model.

  1. Press Ctrl+D to exit and delete the pod.

Configuring Horizontal Pod Autoscaling

Prerequisites

  • Prometheus is installed. NVIDIA development and testing used the Prometheus Community Kubernetes Helm Charts and used a command like the following example:

    $ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
    $ helm repo update
    
    $ helm install prometheus prometheus-community/prometheus --namespace prometheus --create-namespace
    

    If you do not have a default storage class, add the following command-line arguments:

    --set server.persistentVolume.storageClass=<storage-class> --set alertmanager.persistence.storageClass=<storage-class>
    
  • Prometheus Adapter is installed. NVIDIA development and testing used the same Prometheus Community Kubernetes Helm Charts and used a command like the following example:

    $ helm install prometheus-adapter prometheus-community/prometheus-adapter \
        --namespace prometheus-adapter \
        --create-namespace \
        --set prometheus.url=http://prometheus-server.prometheus.svc.cluster.local \
        --set prometheus.port="80"
    

Autoscaling NIM for LLMs

NVIDIA NIM for LLMs provides several service metrics. Refer to Observability in the NVIDIA NIM for LLMs documentation for information about the metrics.

  1. Annotate the service resource related to NIM for LLMs:

    $ kubectl annotate -n nim-service svc meta-llama3-8b-instruct prometheus.io/scrape=true
    

    Prometheus might require several minutes to begin collecting metrics from the service.

  2. Optional: Confirm Prometheus collects the metrics.

    • If you have access to the Prometheus dashboard, search for a service metric such as gpu_cache_usage_perc.

    • You can query Prometheus Adapter:

      $ kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/nim-service/services/*/gpu_cache_usage_perc" | jq .
      

      Example Output

      {
        "kind": "MetricValueList",
        "apiVersion": "custom.metrics.k8s.io/v1beta1",
        "metadata": {},
        "items": [
          {
            "describedObject": {
              "kind": "Service",
              "namespace": "nim-service",
              "name": "meta-llama3-8b-instruct",
              "apiVersion": "/v1"
            },
            "metricName": "gpu_cache_usage_perc",
            "timestamp": "2024-09-12T15:14:20Z",
            "value": "0",
            "selector": null
          }
        ]
      }
      
  3. Create a file, such as service-hpa.yaml, with contents like the following example:

    apiVersion: apps.nvidia.com/v1alpha1
    kind: NIMService
    metadata:
      name: meta-llama3-8b-instruct
    spec:
      image:
        repository: nvcr.io/nim/meta/llama3-8b-instruct
        tag: 1.0.3
        pullPolicy: IfNotPresent
        pullSecrets:
          - ngc-secret
      authSecret: ngc-api-secret
      storage:
        nimCache:
          name: meta-llama3-8b-instruct
          profile: ''
      replicas: 1
      resources:
        limits:
          nvidia.com/gpu: 1
      expose:
        service:
          type: ClusterIP
          port: 8000
      scale:
        enabled: true
        hpa:
          maxReplicas: 2
          minReplicas: 1
          metrics:
          - type: Object
            object:
              metric:
                name: gpu_cache_usage_perc
              describedObject:
                apiVersion: v1
                kind: Service
                name: meta-llama3-8b-instruct
              target:
                type: Value
                value: "0.5"
    
  4. Apply the manifest:

    $ kubectl apply -n nim-service -f service-hpa.yaml
    
  5. Optional: Confirm the horizontal pod autoscaler resource is created:

    $ kubectl get hpa -n nim-service
    

    Example Output

    NAME                      REFERENCE                            TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
    meta-llama3-8b-instruct   Deployment/meta-llama3-8b-instruct   0/500m    1         2         1          40s
    

Autoscaling Embedding and Reranking Services

The embedding and reranking services do not expose service metrics. To scale these services, you can monitor the pod-level metrics provided by NVIDIA DCGM Exporter. Refer to the metrics-config.yaml in the GitHub repository for the default metric names.

  1. Optional: Confirm Prometheus collects the metrics.

    • If you have access to the Prometheus dashboard, search for a service metric such as DCGM_FI_DEV_FB_USED.

    • You can query Prometheus Adapter:

      $ kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/nim-service/pods/*/DCGM_FI_DEV_FB_USED" | jq .
      

      Partial Output

       {
         "describedObject": {
           "kind": "Pod",
           "namespace": "nim-service",
           "name": "nv-embedqa-e5-v5-78cb9874c4-ghpmc",
           "apiVersion": "/v1"
         },
         "metricName": "DCGM_FI_DEV_GPU_UTIL",
         "timestamp": "2024-09-12T16:16:53Z",
         "value": "0",
         "selector": null
       }
      
  2. Create a file, such as service-hpa-dcgm.yaml, with contents like the following example:

    apiVersion: apps.nvidia.com/v1alpha1
    kind: NIMService
    metadata:
      name: nv-embedqa-e5-v5
    spec:
      image:
        repository: nvcr.io/nim/nvidia/nv-embedqa-e5-v5
        tag: 1.0.4
        pullPolicy: IfNotPresent
        pullSecrets:
          - ngc-secret
      authSecret: ngc-api-secret
      storage:
        nimCache:
          name: nv-embedqa-e5-v5
          profile: ''
      replicas: 1
      resources:
        limits:
          nvidia.com/gpu: 1
      expose:
        service:
          type: ClusterIP
          port: 8000
      scale:
        enabled: true
        hpa:
          maxReplicas: 2
          minReplicas: 1
          metrics:
          - type: Pods
            pods:
              metric:
                name: DCGM_FI_DEV_FB_USED
              target:
                type: AverageValue
                averageValue: "1000"
    ---
    apiVersion: apps.nvidia.com/v1alpha1
    kind: NIMService
    metadata:
      name: nv-rerankqa-mistral-4b-v3
    spec:
      image:
        repository: nvcr.io/nim/nvidia/nv-rerankqa-mistral-4b-v3
        tag: 1.0.4
        pullPolicy: IfNotPresent
        pullSecrets:
          - ngc-secret
      authSecret: ngc-api-secret
      storage:
        nimCache:
          name: nv-rerankqa-mistral-4b-v3
          profile: ''
      replicas: 1
      resources:
        limits:
          nvidia.com/gpu: 1
      expose:
        service:
          type: ClusterIP
          port: 8000
      scale:
        enabled: true
        hpa:
          maxReplicas: 2
          minReplicas: 1
          metrics:
          - type: Pods
            pods:
              metric:
                name: DCGM_FI_DEV_FB_USED
              target:
                type: AverageValue
                averageValue: "1000"
    
  3. Apply the manifest:

    $ kubectl apply -n nim-service -f service-hpa-dcgm.yaml
    

Sample Manifests

Example: Create a PVC Instead of Using a NIM Cache

As an alternative to creating NIM cache resources to download and cache NIM model profiles, you can specify that the Operator create a PVC and the NIM service downloads and runs a NIM model profile.

Create and apply a manifest like the following example:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama3-8b-instruct
spec:
  image:
    repository: nvcr.io/nim/meta/llama3-8b-instruct
    tag: 1.0.3
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  replicas: 1
  storage:
    pvc:
      create: true
      storageClass: <storage-class-name>
      size: 10Gi
      volumeAccessMode: ReadWriteMany
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000

Example: Air-Gapped Environment

For air-gapped environments, you must download the model profiles for the NIM microservices from a host that has internet access. You must manually create a PVC and then transfer the model profile files into the PVC.

Typically, the Operator determines the PVC name by dereferencing it from the NIM cache resource. When there is no NIM cache resource, such as an air-gapped environment, you must specify the PVC name.

Create and apply a manifest like the following example:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama3-8b-instruct
spec:
  image:
    repository: nvcr.io/nim/meta/llama3-8b-instruct
    tag: 1.0.3
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  replicas: 1
  storage:
    pvc:
      name: <existing-pvc-name>
      readOnly: true
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000

Deleting a NIM Service

To delete a NIM service perform the following steps.

  1. View the NIM services custom resources:

    $ kubectl get nimservices.apps.nvidia.com -A
    

    Example Output

    NAMESPACE     NAME                        STATUS   AGE
    nim-service   meta-llama3-8b-instruct     Ready    2024-08-12T17:16:05Z
    
  2. Delete the custom resource:

    $ kubectl delete nimservice -n nim-service meta-llama3-8b-instruct
    

    If the Operator created the PVC when you created the NIM cache, the Operator deletes the PVC and the cached model profiles. You can determine if the Operator created the PVC by running a command like the following example:

    $ kubectl get nimcaches.apps.nvidia.com -n nim-service \
       -o=jsonpath='{range .items[*]}{.metadata.name}: {.spec.storage.pvc.create}{"\n"}{end}'
    

    Example Output

    meta-llama3-8b-instruct: true
    

Next Steps