Caching Models

About Model Profiles and Caching

The models for NVIDIA NIM microservices use model engines that are tuned for specific NVIDIA GPUs, number of GPUs, precision, and so on. NVIDIA produces model engines for several popular combinations and these are referred to as model profiles.

The NIM microservices support automatic profile selection by determining the GPU model and count on the node and attempting to match the optimal model profile. Alternatively, NIM microservices support running a specified model profile, but this requires that you review the profiles and know the profile ID. For more information, refer to Model Profiles in the NVIDIA NIM for LLMs documentation.

The Operator compliments model selection by the NIM microservices in the following ways:

  • Support for caching and running a specified model profile ID, like the NIM microservices.

  • Support for caching all model profiles.

  • Support for influencing model selection by specifying the engine, precision, tensor parallelism, and so on to match. The NIM cache custom resource provides the way to specify the model selection criteria.

NVIDIA recommends caching models for NVIDIA NIM microservices. By caching a model, the microservice startup time is improved. For deployments that scale to more than one NIM microservice pod, a single cached model in persistent volume with a network storage class can serve multiple pods.

For single-node clusters, the Local Path Provisioner from Rancher Labs is sufficient for research and development. For production, NVIDIA recommends installing a provisioner that provides a network storage class.

About the NIM Cache Custom Resource Definition

A NIM cache is a Kubernetes custom resource, nimcaches.apps.nvidia.com. You create and delete NIM cache resources to manage model caching.

When you create a NIM cache resource, the NIM Operator starts a pod that lists the available model profiles. The Operator creates a config map of the model profiles.

If you specify one or more model profile IDs to cache, the Operator starts a job that caches the model profiles that you specified.

If you did not specify model profile IDs, but do specify engine: tensorrt_llm or engine: tensorrt, the Operator attempts to match the model profiles with the GPUs on the nodes in the cluster. The Operator uses the value of the nvidia.com/gpu.product node label that is set by Node Feature Discovery.

You can let the Operator automatically detect the model profiles to cache or you can constrain the model profiles by specifying values for spec.source.ngc.model, such as the engine, GPU model, and so on, that must match the model profile.

If you delete a NIM cache resource that was created with spec.storage.pvc.create: true, the NIM Operator deletes the persistent volume (PV) and persistent volume claim (PVC).

Refer to the following sample manifest:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: meta-llama3-8b-instruct
spec:
  source:
    ngc:
      modelPuller: nvcr.io/nim/meta/llama3-8b-instruct:1.0.3
      pullSecret: ngc-secret
      authSecret: ngc-api-secret
      model:
        engine: tensorrt_llm
        tensorParallelism: "1"
  storage:
    pvc:
      create: true
      storageClass: <storage-class-name>
      size: "50Gi"
      volumeAccessMode: ReadWriteMany
  resources: {}

Refer to the following table for information about the commonly modified fields:

Field

Description

Default Value

spec.certConfig

Specifies custom CA certificates that might be required in environments with an HTTP proxy. Refer to Supporting Custom CA Certificates for more information.

None

spec.gpuSelectors

Specifies node selector labels to schedule the caching job.

None

spec.source.ngc.model.engine

Specifies a model caching constraint based on the engine. Common values are as follows:

  • tensorrt_llm – optimized engine for NIM for LLMs

  • vllm – community engine for NIM for LLMs

  • tensorrt – optimized engine for embedding and reranking

  • onnx – community engine

Each NIM microservice determines the supported engines. Refer to the microservice documentation for the latest information.

By default, the caching job matches model profiles for all engines.

None

spec.source.ngc.model.gpus

Specifies a list of model caching constraints to use the specified GPU model, by PCI ID or product name.

By default, the caching job detects all GPU models in the cluster and matches model profiles for all GPUs.

The following partial specification requests a model profile that is compatible with an NVIDIA L40S.

spec:
  source:
    ngc:
      model:
        ...
        gpus:
        - ids:
          - "26b5"

If GPU Operator or Node Feature Discovery is running on your cluster, you can determine the PCI IDs for the GPU models on your nodes by viewing the node labels that begin with feature.node.kubernetes.io/pci-10de-. Alternatively, if you know the device name, you can look up the PCI ID from the table in the NVIDIA Open Kernel Modules repository on GitHub.

The following partial specification requests a model profile that is compatible with an NVIDIA A100.

spec:
  source:
    ngc:
      model:
        ...
        gpus:
        - product: "a100"

The product name, such as h100, a100, or l40s must match the model profile, as shown for the list-model-profiles command in the NVIDIA NIM for LLMs documentation. The value is not case-sensitive.

None

spec.source.ngc.model.lora

When set to true, specifies to cache a model profile that is compatible with LoRA.

Refer to Using LoRA Models and Adapters for more information.

false

spec.source.ngc.model.precision

Specifies the model profile quantization to match. Common values are fp16, bf16, and fp8. Like the GPU product name field, the value must match the model profile, such as shown in the list-model-profiles command.

None

spec.source.ngc.model.profiles

Specifies an array of model profiles to cache.

When you specify this field, automatic profile selection is disabled and all other source.ngc.model fields are ignored.

The following partial specification requests a specific model profile.

spec:
  source:
    ngc:
      model:
        modelPuller: nvcr.io/nim/meta/llama3-8b-instruct:1.0.3
        profiles:
        - 8835c31...

You can determine the model profiles by running the list-model-profiles command.

You can specify all to download all model profiles.

None

spec.source.ngc.model.qosProfile

Specifies the model profile quality of service to match. Values are latency or throughput.

None

spec.source.ngc.model.tensorParallelism

Specifies the model profile tensor parallelism to match. The node that runs the NIM microservice and serves the model must have at least the specified number of GPUs. Common values are "1", "2", and "4".

None

spec.source.ngc.modelPuller

Specifies the container image that can cache model profiles.

None

spec.storage.pvc.create

When set to true, the Operator creates the PVC. If you delete a NIM cache resource and this field was set to true, the Operator deletes the PVC and and the cached models.

false

spec.storage.pvc.name

Specifies the PVC name.

This field is required if you specify create: false.

The NIM cache resource name with a -pvc suffix.

spec.storage.pvc.size

Specifies the size, in Gi, for the PVC to create.

This field is required if you specify create: true.

None

spec.storage.pvc.storageClass

Specifies the storage class for the PVC to create.

None

spec.storage.pvc.subPath

Specifies to create a subpath on the PVC and cache the model profiles in the directory.

None

spec.storage.pvc.volumeAccessMode

Specifies the access mode for the PVC to create.

None

spec.tolerations

Specifies the tolerations for the caching job.

None

Prerequisites

  • Installed the NVIDIA NIM Operator.

  • A persistent volume provisioner that uses network storage such as NFS, S3, vSAN, and so on. The models are downloaded and stored in persistent storage.

    You can create a PVC and specify the name when you create the NIM cache resource or you can request that the Operator creates the PVC.

  • The sample manifests show commonly used container images for the spec.source.ngc.modelPuller field. To cache alternative models, consider the following approaches to learning container image names:

Procedure

  1. Create the namespace:

    $ kubectl create namespace nim-service
    
  2. Add secrets that use your NGC API key.

    • Add a Docker registry secret for downloading the NIM container image from NVIDIA NGC:

      $ kubectl create secret -n nim-service docker-registry ngc-secret \
          --docker-server=nvcr.io \
          --docker-username='$oauthtoken' \
          --docker-password=<ngc-api-key>
      
    • Add a generic secret that the model puller init container uses to download the model from NVIDIA NGC:

      $ kubectl create secret -n nim-service generic ngc-api-secret \
          --from-literal=NGC_API_KEY=<ngc-api-key>
      
  3. Create a file, such as cache-all.yaml, with contents like the following example:

    apiVersion: apps.nvidia.com/v1alpha1
    kind: NIMCache
    metadata:
      name: meta-llama3-8b-instruct
    spec:
      source:
        ngc:
          modelPuller: nvcr.io/nim/meta/llama3-8b-instruct:1.0.3
          pullSecret: ngc-secret
          authSecret: ngc-api-secret
          model:
            engine: tensorrt_llm
            tensorParallelism: "1"
      storage:
        pvc:
          create: true
          storageClass: <storage-class-name>
          size: "50Gi"
          volumeAccessMode: ReadWriteMany
      resources: {}
    ---
    apiVersion: apps.nvidia.com/v1alpha1
    kind: NIMCache
    metadata:
      name: nv-embedqa-e5-v5
    spec:
      source:
        ngc:
          modelPuller: nvcr.io/nim/nvidia/nv-embedqa-e5-v5:1.0.4
          pullSecret: ngc-secret
          authSecret: ngc-api-secret
          model:
            profiles:
            - all
      storage:
        pvc:
          create: true
          storageClass: <storage-class-name>
          size: "50Gi"
          volumeAccessMode: ReadWriteMany
      resources: {}
    ---
    apiVersion: apps.nvidia.com/v1alpha1
    kind: NIMCache
    metadata:
      name: nv-rerankqa-mistral-4b-v3
    spec:
      source:
        ngc:
          modelPuller: nvcr.io/nim/nvidia/nv-rerankqa-mistral-4b-v3:1.0.4
          pullSecret: ngc-secret
          authSecret: ngc-api-secret
          model:
            profiles:
            - all
      storage:
        pvc:
          create: true
          storageClass: <storage-class-name>
          size: "50Gi"
          volumeAccessMode: ReadWriteMany
      resources: {}
    
  4. Apply the manifest:

    $ kubectl apply -n nim-service -f cache-all.yaml
    
  5. Optional: View information about the caching progress.

    • Confirm a persistent volume and claim are created:

      $ kubectl get -n nim-service pvc,pv
      

      Example Output

      NAME                                                STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE
      persistentvolumeclaim/meta-llama3-8b-instruct-pvc   Bound    pvc-1d3f8d48-660f-4e44-9796-c80a4ed308ce   50Gi       RWO            nfs-client     <unset>                 10m
      
      NAME                                                        CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                     STORAGECLASS   VOLUMEATTRIBUTESCLASS   REASON   AGE
      persistentvolume/pvc-1d3f8d48-660f-4e44-9796-c80a4ed308ce   50Gi       RWO            Delete           Bound    nim-service/meta-llama3-8b-instruct-pvc   nfs-client     <unset>                          10m
      
    • View the NIM cache resource to view the status:

      $ kubectl get nimcaches.apps.nvidia.com -n nim-service
      

      Example Output

      NAME                        STATUS   PVC                             AGE
      meta-llama3-8b-instruct     Ready    meta-llama3-8b-instruct-pvc     2024-09-19T13:20:53Z
      nv-embedqa-e5-v5            Ready    nv-embedqa-e5-v5-pvc            2024-09-18T21:11:37Z
      nv-rerankqa-mistral-4b-v3   Ready    nv-rerankqa-mistral-4b-v3-pvc   2024-09-18T21:11:37Z
      

Displaying the NIM Cache Status

  • Display the NIM cache status:

    $ kubectl get nimcaches.apps.nvidia.com -n nim-service
    

    Example Output

    NAME                        STATUS    PVC                             AGE
    meta-llama3-8b-instruct     Ready     meta-llama3-8b-instruct         2024-08-09T20:54:28Z
    nv-embedqa-e5-v5            Ready     nv-embedqa-e5-v5-pvc            2024-08-09T20:54:28Z
    nv-rerankqa-mistral-4b-v3   Ready     nv-rerankqa-mistral-4b-v3-pvc   2024-08-09T20:54:28Z
    

The NIM cache object can report the following statuses:

Status

Description

Failed

The job failed to download and cache the model profile.

InProgress

The job is downloading the model profile from NGC.

NotReady

The job is not ready. This status can be reported shortly after creating the NIM cache resource while the image for the pod is downloaded from NGC.

For more information, run kubectl describe nimcache -n nim-service <nim-cache-name>.

Pending

The job is created, but has not yet started and become active.

PVC-Created

The Operator creates a PVC for the model profile cache if you set spec.storage.pvc.create: true.

Ready

The job downloaded and cached the model profile.

Started

The Operator creates a job to download the model profile from NGC.

Displaying Cached Model Profiles

  • View the .status.profiles field of the custom resource:

    $ kubectl get nimcaches.apps.nvidia.com -n nim-service \
        meta-llama3-8b-instruct -o=jsonpath="{.status.profiles}" | jq .
    

    Example Output

    [
      {
        "config": {
          "feat_lora": "false",
          "llm_engine": "vllm",
          "precision": "fp16",
          "tp": "2"
        },
        "model": "meta/llama3-8b-instruct",
        "name": "19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f",
        "release": "1.0.3"
      }
    ]
    

Caching Models in Air-Gapped Environments

You can run a NIM microservice container on a host with network access, cache the model, create a container with the model cache, and then run a job to copy the model cache to a PVC.

For more information about the steps in the following procedure that run the NIM container and download model profiles, refer to Serving Models from Local Assets in the NVIDIA NIM for LLMs documentation.

The following sections show one way to cache models and add them to a PVC.

Supporting Custom CA Certificates

If your cluster has an HTTP proxy that requires custom certificates, you can add them in a config map and mount them into the NIM cache job. You can use self-signed or custom CA certificates.

  1. Create a config map with the certificates:

    $ kubectl create configmap -n nim-service ca-certs --from-file=<path-to-cert-file-1> --from-file=<path-to-cert-file-2>
    
  2. When you create the NIM cache job, specify the name of the config map and the path to mount them in the container:

    spec:
      certConfig:
        name: ca-certs
        mountPath: /usr/local/share/ca-certificates/
    

Downloading the Models

  1. Export your NGC API key as an environment variable:

    export NGC_API_KEY=M2...
    
  2. Run the NIM container image locally, list the model profiles, and download the model profile.

    • Start the container:

      $ mkdir cache
      
      $ docker run --rm -it \
          -v ./cache:/opt/nim/.cache \
          -u $(id -u):$(id -g) \
          -e NGC_API_KEY \
          nvcr.io/nim/meta/llama3-8b-instruct:1.0.3 \
          bash
      

      Replace the container image and tag with the NIM microservice that you want to cache models for.

    • List the model profiles:

      $ list-model-profiles
      

      Partial Output

      ...
      MODEL PROFILES
      - Compatible with system and runnable:
       - 19031a45cf096b683c4d66fff2a072c0e164a24f19728a58771ebfc4c9ade44f (vllm-fp16-tp2)
      ...
      
    • Download and cache the model profiles:

      $ download-to-cache --profile 1903...
      
    • Exit the container:

      $ exit
      

Copying the Models to a Container

  1. Make a Dockerfile that copies the model profiles into the container:

    FROM nvcr.io/nvidia/base/ubuntu:22.04_20240212
    
    COPY cache /cache
    
  2. Build and push the container with the model profiles to a container registry that the air-gapped cluster has access to:

    $ docker build <private-registry-name>/<model-name>:<tag>
    
    $ docker push <private-registry-name>/<model-name>:<tag>
    

Copying the Models From the Container to the PVC

  1. Optional: Create a PVC if you do not already have one:

    • Create a manifest file, such as model-cache-pvc.yaml, with contents like the following example:

      apiVersion: v1
      kind: PersistentVolumeClaim
      metadata:
        name: model-cache-pvc
        namespace: nim-service
      spec:
        storageClassName: <storage-class>
        resources:
          requests:
            storage: 10Gi
      
    • Apply the manifest:

    $ kubectl apply -f model-cache-pvc.yaml
    
  2. Create a job that copies the model profiles to the PVC.

    • Create a manifest file, such as copy-cache-job.yaml, with contents like the following example:

      apiVersion: batch/v1
      kind: Job
      metadata:
        name: copy-cache
        namespace: nim-service
      spec:
        ttlSecondsAfterFinished: 3600
        template:
          spec:
            restartPolicy: Never
            containers:
            - name: copy-cache
              command:
              - "/bin/sh"
              args:
              - "-c"
              - "cd /cache && cp -R . /model-store && find /model-store"
              image: <private-registry-name>/<model-name>:<tag>
              imagePullPolicy: IfNotPresent
              securityContext:
                allowPrivilegeEscalation: false
                runAsGroup: 2000
                runAsNonRoot: true
                runAsUser: 1000
              volumeMounts:
              - mountPath: /model-store
                name: nim-cache-volume
                readOnly: false
            imagePullSecrets:
            - name: my-secret
            securityContext:
              fsGroup: 2000
              runAsNonRoot: true
              runAsUser: 1000
            volumes:
            - name: nim-cache-volume
              persistentVolumeClaim:
                claimName: model-cache-pvc
        backoffLimit: 4
      
    • Apply the manifest:

      $ kubectl apply -f copy-cache-job.yaml
      

Using LoRA Models and Adapters

NVIDIA NIM for LLMs supports LoRA parameter-efficient fine-tuning (PEFT) adapters trained by the NeMo Framework and Hugging Face Transformers libraries.

You must download the LoRA adapters manually and make them available to the NIM microservice. The following steps describe one way to meet the requirement.

  1. Specify lora: true in the NIM cache manifest that you apply:

    kind: NIMCache
    metadata:
      name: meta-llama3-8b-instruct
    spec:
      source:
        ngc:
          modelPuller: nvcr.io/nim/meta/llama3-8b-instruct:1.0.3
          pullSecret: ngc-secret
          authSecret: ngc-api-secret
          model:
            tensorParallelism: "1"
            lora: true
            gpus:
            - product: "a100"
      storage:
        pvc:
          create: true
          size: "50Gi"
          storageClass: <storage-class>
          volumeAccessMode: ReadWriteMany
      resources: {}
    

    Apply the manifest and wait until kubectl get nimcache -n nim-service meta-llama3-8b-instruct shows the NIM cache is Ready.

  2. Use the NGC CLI or Hugging Face CLI to download the LoRA adapters.

    • Build a container that includes the CLIs, such as the following example:

      FROM nvcr.io/nvidia/base/ubuntu:22.04_20240212
      ARG NGC_CLI_VERSION=3.50.0
      
      RUN apt-get update && DEBIAN_FRONTEND=noninteractive && \
            apt-get install --no-install-recommends -y \
            wget \
            unzip \
            python3-pip
      
      RUN useradd -m -s /bin/bash -u 1000 ubuntu
      USER ubuntu
      
      RUN wget --content-disposition --no-check-certificate \
            https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/${NGC_CLI_VERSION}/files/ngccli_linux.zip -O /tmp/ngccli_linux.zip && \
          unzip /tmp/ngccli_linux.zip -d ~ && \
          rm /tmp/ngccli_linux.zip
      
      ENV PATH=/home/ubuntu/ngc-cli:$PATH
      
      RUN pip install -U "huggingface_hub[cli]"
      

      Push the container to a registry that the nodes in your cluster can access.

    • Apply a manifest like the following example that runs the container.

      When you create your manifest, refer to the key considerations for the pod specification:

      • Mount the same PVC that the NIM microservice accesses for the model.

      • Specify the same user ID and group ID that the NIM cache container used. The following manifest shows the default values.

      • Specify the NGC_CLI_API_KEY and NGC_CLI_ORG environment variables. The value for the organization might be different.

      • Start the pod in the nim-service namespace so that the pod can access the ngc-secret secret.

      apiVersion: v1
      kind: Pod
      metadata:
        name: ngc-cli
      spec:
        containers:
        - env:
          - name: NGC_CLI_API_KEY
            valueFrom:
              secretKeyRef:
                key: NGC_API_KEY
                name: ngc-api-secret
          - name: NGC_CLI_ORG
            value: "nemo-microservices/ea-participants"
          - name: NIM_PEFT_SOURCE
            value: "/model-store/loras"
          image: <private-registry>/<image-name>:<image-tag>
          command: ["sleep"]
          args: ["inf"]
          name: ngc-cli
          securityContext:
            capabilities:
              drop:
              - ALL
            runAsNonRoot: true
          volumeMounts:
          - mountPath: /model-store
            name: model-store
        restartPolicy: Never
        securityContext:
          fsGroup: 2000
          runAsGroup: 2000
          runAsUser: 1000
          seLinuxOptions:
            level: s0:c28,c2
        volumes:
        - name: model-store
          persistentVolumeClaim:
            claimName: meta-llama3-8b-instruct-pvc
      
  3. Access the pod:

    $ kubectl exec -it -n nim-service ngc-cli -- bash
    

    The pod might report groups: cannot find name for group ID 2000. You can ignore the message.

  4. From the terminal in the pod, download the LoRA adapters.

    • Make a directory for the LoRA adapters:

      $ mkdir $NIM_PEFT_SOURCE
      $ cd $NIM_PEFT_SOURCE
      
    • Download the adapters:

      $ ngc registry model download-version "nim/meta/llama3-8b-instruct-lora:nemo-math-v1"
      $ ngc registry model download-version "nim/meta/llama3-8b-instruct-lora:nemo-squad-v1"
      
    • Rename the directories to match the naming convention for the LoRA model directory structure:

      $ mv llama3-8b-instruct-lora_vnemo-math-v1 llama3-8b-math
      $ mv llama3-8b-instruct-lora_vnemo-squad-v1 llama3-8b-squad
      
    • Press Ctrl+D to exit the pod and then run kubectl delete pod -n nim-service ngc-cli.

When you create a NIM service instance, specify the NIM_PEFT_SOURCE environment variable:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama3-8b-instruct
spec:
  ...
  env:
  - name: NIM_PEFT_SOURCE
    value: "/model-store/loras"

After the NIM microservice is running, monitor the logs for records like the following:

{"level": "INFO", ..., "message": "LoRA models synchronizer successfully initialized!"}
{"level": "INFO", ..., "message": "Synchronizing LoRA models with local LoRA directory ..."}
{"level": "INFO", ..., "message": "Done synchronizing LoRA models with local LoRA directory"}

The preceding steps are sample commands for downloading the NeMo format LoRA adapters. Refer to Parameter-Efficient Fine-Tuning in the NVIDIA NIM for LLMs documentation for information about using the Hugging Face Transformers format, the model directory structure, adapters for other models, and so on.

Troubleshooting

NIM Cache Reports No Status

If you run kubectl get nimcache -n nim-service and the output does not report a status, perform the following actions to get more information:

  • Determine the state of the caching jobs:

    $ kubectl get jobs -n nim-service
    

    Example Output

    NAME                            COMPLETIONS   DURATION   AGE
    meta-llama3-8b-instruct-job     1/1           8s         2m57s
    

    View the logs from the job with a command like kubectl logs -n nim-service job/meta-llama3-8b-instruct-job.

    If the caching job is no longer available, delete the NIM cache resource and reapply the manifest.

  • Describe the NIM cache resource and review the conditions:

    $ kubectl describe nimcache -n nim-service <nim-cache-name>
    

    Partial Output

    Status:
      Conditions:
        Last Transition Time:  2024-09-19T13:20:53Z
        Message:               The PVC has been created for caching NIM model
        Reason:                PVCCreated
        Status:                True
        Type:                  NIM_CACHE_PVC_CREATED
        Last Transition Time:  2024-09-19T13:26:24Z
        Message:
        Reason:                Reconciled
        Status:                False
        Type:                  NIM_CACHE_RECONCILE_FAILED
        Last Transition Time:  2024-09-19T13:24:36Z
        Message:               The Job to cache NIM has been created
        Reason:                JobCreated
        Status:                True
        Type:                  NIM_CACHE_JOB_CREATED
        Last Transition Time:  2024-09-19T13:25:50Z
        Message:               The Job to cache NIM is in pending state
        Reason:                JobPending
        Status:                True
        Type:                  NIM_CACHE_JOB_PENDING
        Last Transition Time:  2024-09-19T13:25:50Z
        Message:               The Job to cache NIM has successfully completed
        Reason:                JobCompleted
        Status:                True
        Type:                  NIM_CACHE_JOB_COMPLETED
    

    The preceding output shows a NIM cache resource that eventually succeeded in downloading and caching a model profile.

    The NIM_CACHE_RECONCILE_FAILED condition was reported during the interval that the caching job was created by the Operator, but before the pod was running because the image was downloading from NGC. In the output, the status for that condition is set to False to indicate that the condition is no longer accurate.

  • View the Operator logs by running kubectl logs -n nim-operator -l app.kubernetes.io/name=k8s-nim-operator.

No Profiles Are Selected for Caching

If the NIM cache controller does not automatically select a model profile to cache, you can use two methods to view the model profiles that are available:

  • The cache controller copies the model profiles into a config map. You can review the config map with a command like the following example to identify a model profile that uses a community backend such as vLLM or ONNX.

    $ kubectl get cm -n nim-service meta-llama3-8b-instruct-manifest -o yaml | less
    
  • Run the container on a host that is configured for Docker, NVIDIA Container Toolkit, and an NVIDIA GPU. Refer to the list-model-profiles command in the NVIDIA NIM for LLMs documentation.

After you determine a model profile that is compatible with your GPU, specify the model profile ID in the spec.source.ngc.model.profiles field and reapply the manifest.

Deleting a NIM Cache

To delete a NIM cache perform the following steps.

  1. View the NIM cache custom resources:

    $ kubectl get nimcaches.apps.nvidia.com -A
    

    Example Output

    NAMESPACE     NAME                      STATUS   PVC           AGE
    nim-service   meta-llama3-8b-instruct   ready    model-store   2024-08-08T13:14:30Z
    
  2. Delete the custom resource:

    $ kubectl delete nimcache -n nim-service meta-llama3-8b-instruct
    

    If the Operator created the PVC when you created the NIM cache, the Operator deletes the PVC and the cached model profiles. You can determine if the Operator created the PVC by running a command like the following example:

    $ kubectl get nimcaches.apps.nvidia.com -n nim-service \
       -o=jsonpath='{range .items[*]}{.metadata.name}: {.spec.storage.pvc.create}{"\n"}{end}'
    

    Example Output

    meta-llama3-8b-instruct: true
    

Next Steps

  • Deploy NIM microservices either by adding NIM service custom resources or managing several services in a single NIM pipeline custom resource.