Caching LLM NIM#

Support for NVIDIA NIM for Multi-LLM and LLM-Specific NIM#

NIM Cache supports two options for NVIDIA NIM for Large Language Models (LLMs):

  • LLM-Specific NIM: Each container is focused on individual models or model families, offering maximum performance.

  • Multi-LLM NIM: A single container that enables the deployment of a broad range of models, offering maximum flexibility.

Refer to Overview of NVIDIA NIM for Large Language Models (LLMs) for more information.

Refer to Configure Your NIM for LLMs for detailed configuration instructions.

Caching LLM-Specific NIM#

NVIDIA LLM NIM microservices use model engines that are tuned for specific NVIDIA GPUs, number of GPUs, precision, and other resources. NVIDIA produces model engines for several popular combinations, and these are referred to as model profiles.

NIM LLM microservices support automatic profile selection by determining the GPU model and count on the node and attempt to match the optimal model profile. Alternatively, NIM supports running a specified model profile, but this requires that you review the profiles and know the profile ID. For more information, refer to Model Profiles in the NVIDIA NIM for LLMs documentation.

The Operator complements model selection by NIM microservices in the following ways:

  • Support for caching and running a specified model profile ID, like the NIM microservices.

  • Support for caching all model profiles.

  • Support for influencing model selection by specifying the engine, precision, tensor parallelism, and other parameters to match.

The NIM Cache custom resource manages model caching and provides the way to specify the model selection criteria.

LLM-Specific NIM Cache Sources#

You can pull from the following sources and protocols:

When you create a NIM Cache resource with the NGC Catalog as the source, the NIM Operator starts a pod that lists the available model profiles. The Operator creates a config map of the model profiles.

LLM-Specific NIM provides NVIDIA-validated, optimized model profiles for popular data center GPU models, varying GPU counts, and specific numeric precisions.

To pull models from the NGC Catalog, you must have created Kubernetes secrets to hold your NGC Catalog API key and pass the secret names as source.ngc.model.pullSecret and source.ngc.model.authSecret. Refer to Image Pull Secrets for more details on creating these secrets.

Refer to the following sample manifest using the NGC catalog as a cache source:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: meta-llama-3-2-1b-instruct
  namespace: nim-service
spec:
  source:
    ngc:
      modelPuller: nvcr.io/nim/meta/llama-3.2-1b-instruct:1.12.0
      pullSecret: ngc-secret
      authSecret: ngc-api-secret
      model:
        engine: tensorrt_llm
        tensorParallelism: "1"
  storage:
    pvc:
      create: true
      storageClass: ""
      size: "50Gi"
      volumeAccessMode: ReadWriteOnce

Use source.ngc.model. object to describe the LLM-Specific model you want to pull from the NGC Catalog.

When using source.ngc.model., if you specify one or more model profile IDs to cache, the Operator starts a job that caches the model profiles that you specified.

If you did not specify model profile IDs, but do specify engine: tensorrt_llm or engine: tensorrt, the Operator attempts to match the model profiles with the GPUs on the nodes in the cluster. The Operator uses the value of the nvidia.com/gpu.product node label that is set by Node Feature Discovery.

You can let the Operator automatically detect the model profiles to cache or you can constrain the model profiles by specifying values for spec.source.ngc.model, such as the engine or GPU model, that must match the model profile.

Note

NVIDIA recommends that you use profile filtering when caching models using source.ngc.model.. Models can have several profiles, and without filtering by one or more parameters, you can download more models than intended, which can increase your storage requirements. For more information on NIM profiles and their model storage requirements, refer to the NIM Models documentation.

Refer to the following table for information about fields for NVIDIA NGC Catalog as a NIM Cache Source:

Field

Description

Default Value

spec.source.ngc.model

Specifies an object of filtering information for the LLM-Specific model and profile you want to cache. If you want to cache a Multi-LLM model, use spec.source.ngc.modelEndpoint instead.

None

spec.source.ngc.model.buildable

When set to true, specifies to cache a model profile that can be optimized with an NVIDIA engine for any GPUs. This can be used in conjunction with NIM Build to cache built models.

None

spec.source.ngc.model.engine

Specifies a model caching constraint based on the engine. Common values are as follows:

  • tensorrt_llm – optimized engine for NIM for LLMs

  • vllm – community engine for NIM for LLMs

  • tensorrt – optimized engine for embedding and reranking

  • onnx – community engine

Each NIM microservice determines the supported engines. Refer to the microservice documentation for the latest information.

By default, the caching job matches model profiles for all engines.

None

spec.source.ngc.model.gpus

Specifies a list of model caching constraints to use the specified GPU model, by PCI ID or product name.

By default, the caching job detects all GPU models in the cluster and matches model profiles for all GPUs.

The following partial specification requests a model profile that is compatible with an NVIDIA L40S.

spec:
  source:
    ngc:
      model:
        ...
        gpus:
        - ids:
          - "26b5"

If GPU Operator or Node Feature Discovery is running on your cluster, you can determine the PCI IDs for the GPU models on your nodes by viewing the node labels that begin with feature.node.kubernetes.io/pci-10de-. Alternatively, if you know the device name, you can look up the PCI ID from the table in the NVIDIA Open Kernel Modules repository on GitHub.

The following partial specification requests a model profile that is compatible with an NVIDIA A100.

spec:
  source:
    ngc:
      model:
        ...
        gpus:
        - product: "a100"

The product name, such as h100, a100, or l40s must match the model profile, as shown for the list-model-profiles command in the NVIDIA NIM for LLMs documentation. The value is not case-sensitive.

None

spec.source.ngc.model.lora

When set to true, specifies to cache a model profile that is compatible with LoRA.

Refer to LoRA Models and Adapters for more information.

false

spec.source.ngc.model.precision

Specifies the model profile quantization to match. Common values are fp16, bf16, and fp8. Like the GPU product name field, the value must match the model profile, such as shown in the list-model-profiles command.

None

spec.source.ngc.model.profiles

Specifies an array of model profiles to cache.

When you specify this field, automatic profile selection is disabled and all other source.ngc.model fields are ignored.

The following partial specification requests a specific model profile.

spec:
  source:
    ngc:
      modelPuller: nvcr.io/nim/meta/llama3-8b-instruct:1.0.3
      model:
        profiles:
        - 8835c31...

You can determine the model profiles by running the list-model-profiles command.

You can specify all to download all model profiles. Use the all parameter with care as some models have many profiles which will take several minutes to download and a large amount of storage space to cache.

None

spec.source.ngc.model.qosProfile

Specifies the model profile quality of service to match. Values are latency or throughput.

None

spec.source.ngc.model.tensorParallelism

Specifies the model profile tensor parallelism to match. The node that runs the NIM microservice and serves the model must have at least the specified number of GPUs. Common values are "1", "2", and "4".

None

spec.source.ngc.modelPuller

Specifies the container image that can cache model profiles.

None

spec.source.ngc.modelEndpoint

Specifies the endpoint of the Multi-LLM model you want to cache. Note that you cannot specify a model endpoint and a model in a NIM cache. If you want to cache a LLM-Specific model, use spec.source.ngc.model instead.

None

You can then create a NIM Service using the NIM Cache. The following shows a sample manifest of a NIM Service custom resource using the above NIM Cache:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama-3-2-1b-instruct
  namespace: nim-service
spec:
  image:
    repository: nvcr.io/nim/meta/llama-3.2-1b-instruct
    tag: "1.12.0"
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: meta-llama-3-2-1b-instruct
      profile: ''
  replicas: 1
  resources:
    limits:
      nvidia.com/gpu: 1
  expose:
    service:
      type: ClusterIP
      port: 8000

To use a NGC Mirrored Local Model Registry as a NIM Cache source, set NIM_REPOSITORY_OVERRIDE as an environment variable for the NIM.

Refer to Repository Override for NVIDIA NIM for LLMs for more detailed instructions.

Refer to NIM for LLMs Environment Variables for more information on the NIM_REPOSITORY_OVERRIDE environment variable.

The following is a sample manifest:

# LLM NIM Cache with Mirrored Local Model Registry
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: meta-llama3-2-1b-instruct
  namespace: nim-service
spec:
  env:
    - name: NIM_REPOSITORY_OVERRIDE
      value: "https://<server-name>:<port>/"
  source:
    ngc:
      modelPuller: nvcr.io/nim/meta/llama-3.2-1b-instruct:1.12.0
      pullSecret: ngc-secret
      authSecret: https-api-secret
      model:
        engine: "tensorrt"
        tensorParallelism: "1"
  storage:
    pvc:
      create: true
      storageClass: ''
      size: "50Gi"
      volumeAccessMode: ReadWriteOnce

The NIM Cache fields relevant to Mirrored Local Model Registries are the same as for NVIDIA NGC Catalog as a NIM Cache Source.

For more sample manifests, refer to the config/samples/nim/caching/ngc-mirror/ directory.

Caching Locally Built LLM NIM Engines#

NIM Operator can cache dynamically built TensorRT-LLM engines using NIM Build. This allows you to generate TensorRT-LLM engines for the specific GPUs that you might have. Refer to Caching Locally Built LLM NIM Engines for more information.

LoRA Models and Adapters#

NVIDIA NIM for LLMs supports LoRA parameter-efficient fine-tuning (PEFT) adapters trained by the NeMo Framework and Hugging Face Transformers libraries.

You must download the LoRA adapters manually and make them available to the NIM microservice. The following steps describe one way to meet the requirement.

  1. Specify lora: true in the NIM cache manifest that you apply:

    apiVersion: apps.nvidia.com/v1alpha1
    kind: NIMCache
    metadata:
      name: meta-llama3-8b-instruct
    spec:
      source:
        ngc:
          modelPuller: nvcr.io/nim/meta/llama3-8b-instruct:1.0.3
          pullSecret: ngc-secret
          authSecret: ngc-api-secret
          model:
            tensorParallelism: "1"
            lora: true
            gpus:
            - product: "a100"
      storage:
        pvc:
          create: true
          size: "50Gi"
          storageClass: <storage-class>
          volumeAccessMode: ReadWriteMany
      resources: {}
    

    Apply the manifest and wait until kubectl get nimcache -n nim-service meta-llama3-8b-instruct shows the NIM cache is Ready.

  2. Use the NGC CLI or Hugging Face CLI to download the LoRA adapters.

    • Build a container that includes the CLIs, such as the following example:

      FROM nvcr.io/nvidia/base/ubuntu:22.04_20240212
      ARG NGC_CLI_VERSION=3.50.0
      
      RUN apt-get update && DEBIAN_FRONTEND=noninteractive && \
            apt-get install --no-install-recommends -y \
            wget \
            unzip \
            python3-pip
      
      RUN useradd -m -s /bin/bash -u 1000 ubuntu
      USER ubuntu
      
      RUN wget --content-disposition --no-check-certificate \
            https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/${NGC_CLI_VERSION}/files/ngccli_linux.zip -O /tmp/ngccli_linux.zip && \
          unzip /tmp/ngccli_linux.zip -d ~ && \
          rm /tmp/ngccli_linux.zip
      
      ENV PATH=/home/ubuntu/ngc-cli:$PATH
      
      RUN pip install -U "huggingface_hub[cli]"
      

      Push the container to a registry that the nodes in your cluster can access.

    • Apply a manifest like the following example that runs the container.

      When you create your manifest, refer to the key considerations for the pod specification:

      • Mount the same PVC that the NIM microservice accesses for the model.

      • Specify the same user ID and group ID that the NIM cache container used. The following manifest shows the default values.

      • Specify the NGC_CLI_API_KEY and NGC_CLI_ORG environment variables. The value for the organization might be different.

      • Start the pod in the nim-service namespace so that the pod can access the ngc-secret secret.

      apiVersion: v1
      kind: Pod
      metadata:
        name: ngc-cli
      spec:
        containers:
        - env:
          - name: NGC_CLI_API_KEY
            valueFrom:
              secretKeyRef:
                key: NGC_API_KEY
                name: ngc-api-secret
          - name: NGC_CLI_ORG
            value: "nemo-microservices/ea-participants"
          - name: NIM_PEFT_SOURCE
            value: "/model-store/loras"
          image: <private-registry>/<image-name>:<image-tag>
          command: ["sleep"]
          args: ["inf"]
          name: ngc-cli
          securityContext:
            capabilities:
              drop:
              - ALL
            runAsNonRoot: true
          volumeMounts:
          - mountPath: /model-store
            name: model-store
        restartPolicy: Never
        securityContext:
          fsGroup: 2000
          runAsGroup: 2000
          runAsUser: 1000
          seLinuxOptions:
            level: s0:c28,c2
        volumes:
        - name: model-store
          persistentVolumeClaim:
            claimName: meta-llama3-8b-instruct-pvc
      
  3. Access the pod:

    $ kubectl exec -it -n nim-service ngc-cli -- bash
    

    The pod might report groups: cannot find name for group ID 2000. You can ignore the message.

  4. From the terminal in the pod, download the LoRA adapters.

    • Make a directory for the LoRA adapters:

      $ mkdir $NIM_PEFT_SOURCE
      $ cd $NIM_PEFT_SOURCE
      
    • Download the adapters:

      $ ngc registry model download-version "nim/meta/llama3-8b-instruct-lora:nemo-math-v1"
      $ ngc registry model download-version "nim/meta/llama3-8b-instruct-lora:nemo-squad-v1"
      
    • Rename the directories to match the naming convention for the LoRA model directory structure:

      $ mv llama3-8b-instruct-lora_vnemo-math-v1 llama3-8b-math
      $ mv llama3-8b-instruct-lora_vnemo-squad-v1 llama3-8b-squad
      
    • Press Ctrl+D to exit the pod and then run kubectl delete pod -n nim-service ngc-cli.

When you create a NIM service instance, specify the NIM_PEFT_SOURCE environment variable:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama3-8b-instruct
spec:
  ...
  env:
  - name: NIM_PEFT_SOURCE
    value: "/model-store/loras"

After the NIM microservice is running, monitor the logs for records like the following:

{"level": "INFO", ..., "message": "LoRA models synchronizer successfully initialized!"}
{"level": "INFO", ..., "message": "Synchronizing LoRA models with local LoRA directory ..."}
{"level": "INFO", ..., "message": "Done synchronizing LoRA models with local LoRA directory"}

The preceding steps are sample commands for downloading the NeMo format LoRA adapters. Refer to Parameter-Efficient Fine-Tuning in the NVIDIA NIM for LLMs documentation for information about using the Hugging Face Transformers format, the model directory structure, and adapters for other models.

Caching Multi-LLM Compatible NIM#

To get the container for the Multi-LLM NIM, refer to LLM NIM Overview in the NVIDIA NGC Catalog.

Multi-LLM NIM Cache Sources#

You can pull from registries using the following sources and protocols:

When you create a NIM cache resource with the Hugging Face Hub as the source, the NIM Operator generates a Hugging Face CLI command to pull the requested model or dataset from Hugging Face, using the inputs from the spec.source.hf object.

To cache models or datasets from Hugging Face Hub, you must have a Hugging Face Hub account and an user access token. Your desired models or datasets should be available in your Hugging Face Hub account.
Create a Kubernetes secret with your Hugging Face Hub user access token to use as the spec.source.hf.authSecret.

Note

When downloading models from Hugging Face for Multi-LLM NIM, log files can show permission denied warnings that can be safely ignored.

The following NIM Cache custom resource shows a sample manifest of using the Hugging Face Hub as a cache source:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: nim-cache-multi-llm
  namespace: nim-service
spec:
  source:
    hf:
      endpoint: "https://huggingface.co"
      namespace: "meta-llama"
      authSecret: hf-api-secret
      modelPuller: nvcr.io/nim/nvidia/llm-nim:1.12
      pullSecret: ngc-secret
      modelName: "Llama-3.2-1B-Instruct"
  storage:
    pvc:
      create: true
      storageClass: ''
      size: "50Gi"
      volumeAccessMode: ReadWriteOnce

If you want to cache a dataset, use spec.source.hf.datasetName: "dataset-name" instead of spec.source.hf.modelName. A cache can only be configured to pull a model or a dataset, not both.

Refer to the following table for information about NIM Cache fields pertaining to Hugging Face Hub as a NIM Cache source:

Field

Description

Default Value

spec.source.hf.authSecret (required)

Specifies the name of the secret containing your “HF_TOKEN” token. Required if you are using Hugging Face Hub as a source of your model or dataset.

None

spec.source.hf.datasetName (required)

Specifies the name of the dataset from HuggingFace. Required if you are using Hugging Face Hub as a source of your dataset.

None

spec.source.hf.endpoint (required)

Specifies the HuggingFace endpoint. Required if you are using Hugging Face Hub as a source of your model or dataset.

None

spec.source.hf.modelName

Specifies the name of the model you want to use from HuggingFace. Required if you are using Hugging Face Hub as a source of your model.

None

spec.source.hf.modelPuller

Specifies the containerized huggingface-cli image to pull the data. Required if you are using Hugging Face Hub as a source of your model or dataset.

None

spec.source.hf.namespace

Specifies the namespace in the Hugging Face Hub. Required if you are using Hugging Face Hub as a source of your model or dataset.

None

spec.source.hf.pullSecret

Specifies the name of the image pull secret for the modelPuller image. Required if you are using Hugging Face Hub as a source of your model or dataset.

None

You can then create a NIM Service using the NIM Cache. The following shows a sample manifest of a NIM Service custom resource using the above NIM Cache:

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
  name: meta-llama-3-2-1b-instruct
  namespace: nim-service
spec:
  image:
    repository: nvcr.io/nim/nvidia/llm-nim
    tag: "1.12"
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  authSecret: ngc-api-secret
  storage:
    nimCache:
      name: nim-cache-multi-llm
      profile: 'tensorrt_llm'
  resources:
    limits:
      nvidia.com/gpu: 1
      cpu: "12"
      memory: 32Gi
    requests:
      nvidia.com/gpu: 1
      cpu: "4"
      memory: 6Gi
  replicas: 1
  expose:
    service:
      type: ClusterIP
      port: 8000