Caching Non-LLM NIM#

About NVIDIA NIM Microservices#

NVIDIA NIM microservices are a set of easy-to-use microservices for accelerating the deployment of foundation models on any cloud or data center. These microservices help keep your data secure. NIM microservices have production-grade runtimes and support a wide variety of domains, such as retrieval, vision, speech, biology, and safety and moderation.

For more information, refer to NVIDIA NIM.

Non-LLM NIM Cache Sources#

You can pull from the following sources and protocols:

When you create a NIM Cache resource with the NGC Catalog as the source, the NIM Operator starts a pod that lists the available model profiles. The Operator creates a config map of the model profiles.

To pull models from the NGC Catalog, you must have created Kubernetes secrets to hold your NGC Catalog API key and pass the secret names as source.ngc.model.pullSecret and source.ngc.model.authSecret. Refer to Image Pull Secrets for more details on creating these secrets.

The following shows an example of using the NGC catalog as a cache source.

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: rerankqa-mistral-4b-v3
  namespace: nim-service
spec:
  source:
    ngc:
      modelPuller: nvcr.io/nim/nvidia/nv-rerankqa-mistral-4b-v3:1.0.2
      pullSecret: ngc-secret
      authSecret: ngc-api-secret
      model:   #Include the model object to describe the model you want to pull from NGC
        engine: tensorrt
        tensorParallelism: "1"
  storage:
    pvc:
      create: true
      storageClass: ''
      size: "50Gi"
      volumeAccessMode: ReadWriteOnce

Note

NVIDIA recommends that you use profile filtering when caching models using source.ngc.model.. Models can have several profiles, and without filtering by one or more parameters, you can download more models than intended, which can increase your storage requirements. For more information on NIM profiles and their model storage requirements, refer to the NIM Models documentation.

Refer to the following table for information about fields for NVIDIA NGC Catalog as a NIM Cache Source:

Field

Description

Default Value

spec.source.ngc.model

Specifies an object of filtering information for the LLM-Specific model and profile you want to cache. If you want to cache a Multi-LLM model, use spec.source.ngc.modelEndpoint instead.

None

spec.source.ngc.model.engine

Specifies a model caching constraint based on the engine. Common values are as follows:

  • tensorrt – optimized engine for embedding and reranking

  • onnx – community engine

Each NIM microservice determines the supported engines. Refer to the microservice documentation for the latest information.

By default, the caching job matches model profiles for all engines.

None

spec.source.ngc.model.profiles

Specifies an array of model profiles to cache.

When you specify this field, automatic profile selection is disabled and all other source.ngc.model fields are ignored.

The following partial specification requests a specific model profile.

spec:
  source:
    ngc:
      model:
        modelPuller: nvcr.io/nim/meta/llama3-8b-instruct:1.0.3
        profiles:
        - 8835c31...

You can determine the model profiles by running the list-model-profiles command.

You can specify all to download all model profiles. Use the all parameter with care as some models have many profiles that will take several minutes to download and a large amount of storage space to cache.

None

spec.source.ngc.modelPuller

Specifies the container image that can cache model profiles.

None

To use a NGC Mirrored Local Model Registry as a NIM Cache source, set NIM_REPOSITORY_OVERRIDE as an environment variable for the NIM.

Refer to Repository Override for NVIDIA NIM for LLMs for more detailed instructions.

Refer to NIM for LLMs Environment Variables for more information on the NIM_REPOSITORY_OVERRIDE environment variable.

Note

The NIM Cache fields relevant to Mirrored Local Model Registries are the same as for NVIDIA NGC Catalog as a NIM Cache Source.

The following sample manifests are available in the config/samples/nim/caching/ngc-mirror directory.

S3

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: meta-llama3-2-1b-instruct
  namespace: nim-service
spec:
  env:
    - name: NIM_REPOSITORY_OVERRIDE
      value: "s3://nim_bucket/"
    - name: AWS_PROFILE
      value: "default"
    - name: AWS_REGION
      value: "us-east-1"
  source:
    ngc:
      modelPuller: nvcr.io/nim/meta/llama-3.2-1b-instruct:1.12.0
      pullSecret: ngc-secret
      authSecret: aws-api-secret
      model:
        engine: "tensorrt"
        tensorParallelism: "1"
  storage:
    pvc:
      create: true
      storageClass: ''
      size: "50Gi"
      volumeAccessMode: ReadWriteOnce

Note

You must specify your AWS credentials in the aws-api-secret using the following environment variables:

AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_SESSION_TOKEN  # (if using temporary credentials)

For more information, refer to Configure AWS Credentials.

HTTPS

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: meta-llama3-2-1b-instruct
  namespace: nim-service
spec:
  env:
    - name: NIM_REPOSITORY_OVERRIDE
      value: "https://<server-name>:<port>/"
  source:
    ngc:
      modelPuller: nvcr.io/nim/meta/llama-3.2-1b-instruct:1.12.0
      pullSecret: ngc-secret
      authSecret: https-api-secret
      model:
        engine: "tensorrt"
        tensorParallelism: "1"
  storage:
    pvc:
      create: true
      storageClass: ''
      size: "50Gi"
      volumeAccessMode: ReadWriteOnce

JFrog

apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
  name: meta-llama3-2-1b-instruct
  namespace: nim-service
spec:
  env:
    - name: NIM_REPOSITORY_OVERRIDE
      value: "jfrog://<server-name>:<port>/"
  source:
    ngc:
      modelPuller: nvcr.io/nim/meta/llama-3.2-1b-instruct:1.12.0
      pullSecret: ngc-secret
      authSecret: jfrog-api-secret
      model:
        engine: "tensorrt"
        tensorParallelism: "1"
  storage:
    pvc:
      create: true
      storageClass: ''
      size: "50Gi"
      volumeAccessMode: ReadWriteOnce