Caching Non-LLM NIM#
About NVIDIA NIM Microservices#
NVIDIA NIM microservices are a set of easy-to-use microservices for accelerating the deployment of foundation models on any cloud or data center. These microservices help keep your data secure. NIM microservices have production-grade runtimes and support a wide variety of domains, such as retrieval, vision, speech, biology, and safety and moderation.
For more information, refer to NVIDIA NIM.
Non-LLM NIM Cache Sources#
You can pull from the following sources and protocols:
When you create a NIM Cache resource with the NGC Catalog as the source, the NIM Operator starts a pod that lists the available model profiles. The Operator creates a config map of the model profiles.
To pull models from the NGC Catalog, you must have created Kubernetes secrets to hold your NGC Catalog API key and pass the secret names as source.ngc.model.pullSecret
and source.ngc.model.authSecret
.
Refer to Image Pull Secrets for more details on creating these secrets.
The following shows an example of using the NGC catalog as a cache source.
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
name: rerankqa-mistral-4b-v3
namespace: nim-service
spec:
source:
ngc:
modelPuller: nvcr.io/nim/nvidia/nv-rerankqa-mistral-4b-v3:1.0.2
pullSecret: ngc-secret
authSecret: ngc-api-secret
model: #Include the model object to describe the model you want to pull from NGC
engine: tensorrt
tensorParallelism: "1"
storage:
pvc:
create: true
storageClass: ''
size: "50Gi"
volumeAccessMode: ReadWriteOnce
Note
NVIDIA recommends that you use profile filtering when caching models using source.ngc.model.
.
Models can have several profiles, and without filtering by one or more parameters, you can download more models than intended, which can increase your storage requirements.
For more information on NIM profiles and their model storage requirements, refer to the NIM Models documentation.
Refer to the following table for information about fields for NVIDIA NGC Catalog as a NIM Cache Source:
Field |
Description |
Default Value |
---|---|---|
|
Specifies an object of filtering information for the LLM-Specific model and profile you want to cache.
If you want to cache a Multi-LLM model, use |
None |
|
Specifies a model caching constraint based on the engine. Common values are as follows:
Each NIM microservice determines the supported engines. Refer to the microservice documentation for the latest information. By default, the caching job matches model profiles for all engines. |
None |
|
Specifies an array of model profiles to cache. When you specify this field, automatic profile selection is disabled and all other The following partial specification requests a specific model profile. spec:
source:
ngc:
model:
modelPuller: nvcr.io/nim/meta/llama3-8b-instruct:1.0.3
profiles:
- 8835c31...
You can determine the model profiles by running the list-model-profiles command. You can specify |
None |
|
Specifies the container image that can cache model profiles. |
None |
To use a NGC Mirrored Local Model Registry as a NIM Cache source, set NIM_REPOSITORY_OVERRIDE
as an environment variable for the NIM.
Refer to Repository Override for NVIDIA NIM for LLMs for more detailed instructions.
Refer to NIM for LLMs Environment Variables
for more information on the NIM_REPOSITORY_OVERRIDE
environment variable.
Note
The NIM Cache fields relevant to Mirrored Local Model Registries are the same as for NVIDIA NGC Catalog as a NIM Cache Source.
The following sample manifests are available in the config/samples/nim/caching/ngc-mirror directory.
S3
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
name: meta-llama3-2-1b-instruct
namespace: nim-service
spec:
env:
- name: NIM_REPOSITORY_OVERRIDE
value: "s3://nim_bucket/"
- name: AWS_PROFILE
value: "default"
- name: AWS_REGION
value: "us-east-1"
source:
ngc:
modelPuller: nvcr.io/nim/meta/llama-3.2-1b-instruct:1.12.0
pullSecret: ngc-secret
authSecret: aws-api-secret
model:
engine: "tensorrt"
tensorParallelism: "1"
storage:
pvc:
create: true
storageClass: ''
size: "50Gi"
volumeAccessMode: ReadWriteOnce
Note
You must specify your AWS credentials in the aws-api-secret
using the following environment variables:
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_SESSION_TOKEN # (if using temporary credentials)
For more information, refer to Configure AWS Credentials.
HTTPS
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
name: meta-llama3-2-1b-instruct
namespace: nim-service
spec:
env:
- name: NIM_REPOSITORY_OVERRIDE
value: "https://<server-name>:<port>/"
source:
ngc:
modelPuller: nvcr.io/nim/meta/llama-3.2-1b-instruct:1.12.0
pullSecret: ngc-secret
authSecret: https-api-secret
model:
engine: "tensorrt"
tensorParallelism: "1"
storage:
pvc:
create: true
storageClass: ''
size: "50Gi"
volumeAccessMode: ReadWriteOnce
JFrog
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
name: meta-llama3-2-1b-instruct
namespace: nim-service
spec:
env:
- name: NIM_REPOSITORY_OVERRIDE
value: "jfrog://<server-name>:<port>/"
source:
ngc:
modelPuller: nvcr.io/nim/meta/llama-3.2-1b-instruct:1.12.0
pullSecret: ngc-secret
authSecret: jfrog-api-secret
model:
engine: "tensorrt"
tensorParallelism: "1"
storage:
pvc:
create: true
storageClass: ''
size: "50Gi"
volumeAccessMode: ReadWriteOnce