Caching LLM NIM#
Support for NVIDIA NIM for Multi-LLM and LLM-Specific NIM#
NIM Cache supports two options for NVIDIA NIM for Large Language Models (LLMs):
LLM-Specific NIM: Each container is focused on individual models or model families, offering maximum performance.
Multi-LLM NIM: A single container that enables the deployment of a broad range of models, offering maximum flexibility.
Refer to Overview of NVIDIA NIM for Large Language Models (LLMs) for more information.
Refer to Configure Your NIM for LLMs for detailed configuration instructions.
Caching LLM-Specific NIM#
NVIDIA LLM NIM microservices use model engines that are tuned for specific NVIDIA GPUs, number of GPUs, precision, and other resources. NVIDIA produces model engines for several popular combinations, and these are referred to as model profiles.
NIM LLM microservices support automatic profile selection by determining the GPU model and count on the node and attempt to match the optimal model profile. Alternatively, NIM supports running a specified model profile, but this requires that you review the profiles and know the profile ID. For more information, refer to Model Profiles in the NVIDIA NIM for LLMs documentation.
The Operator complements model selection by NIM microservices in the following ways:
Support for caching and running a specified model profile ID, like the NIM microservices.
Support for caching all model profiles.
Support for influencing model selection by specifying the engine, precision, tensor parallelism, and other parameters to match.
The NIM Cache custom resource manages model caching and provides the way to specify the model selection criteria.
LLM-Specific NIM Cache Sources#
You can pull from the following sources and protocols:
When you create a NIM Cache resource with the NGC Catalog as the source, the NIM Operator starts a pod that lists the available model profiles. The Operator creates a config map of the model profiles.
LLM-Specific NIM provides NVIDIA-validated, optimized model profiles for popular data center GPU models, varying GPU counts, and specific numeric precisions.
To pull models from the NGC Catalog, you must have created Kubernetes secrets to hold your NGC Catalog API key and pass the secret names as source.ngc.model.pullSecret
and source.ngc.model.authSecret
.
Refer to Image Pull Secrets for more details on creating these secrets.
Refer to the following sample manifest using the NGC catalog as a cache source:
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
name: meta-llama-3-2-1b-instruct
namespace: nim-service
spec:
source:
ngc:
modelPuller: nvcr.io/nim/meta/llama-3.2-1b-instruct:1.12.0
pullSecret: ngc-secret
authSecret: ngc-api-secret
model:
engine: tensorrt_llm
tensorParallelism: "1"
storage:
pvc:
create: true
storageClass: ""
size: "50Gi"
volumeAccessMode: ReadWriteOnce
Use source.ngc.model.
object to describe the LLM-Specific model you want to pull from the NGC Catalog.
When using source.ngc.model.
, if you specify one or more model profile IDs to cache, the Operator starts a job that caches the model profiles that you specified.
If you did not specify model profile IDs, but do specify engine: tensorrt_llm
or engine: tensorrt
, the Operator attempts to match the model profiles with the GPUs on the nodes in the cluster.
The Operator uses the value of the nvidia.com/gpu.product
node label that is set by Node Feature Discovery.
You can let the Operator automatically detect the model profiles to cache or you can constrain the model profiles
by specifying values for spec.source.ngc.model
, such as the engine or GPU model, that must match the model profile.
Note
NVIDIA recommends that you use profile filtering when caching models using source.ngc.model.
.
Models can have several profiles, and without filtering by one or more parameters, you can download more models than intended, which can increase your storage requirements.
For more information on NIM profiles and their model storage requirements, refer to the NIM Models documentation.
Refer to the following table for information about fields for NVIDIA NGC Catalog as a NIM Cache Source:
Field |
Description |
Default Value |
---|---|---|
|
Specifies an object of filtering information for the LLM-Specific model and profile you want to cache.
If you want to cache a Multi-LLM model, use |
None |
|
When set to |
None |
|
Specifies a model caching constraint based on the engine. Common values are as follows:
Each NIM microservice determines the supported engines. Refer to the microservice documentation for the latest information. By default, the caching job matches model profiles for all engines. |
None |
|
Specifies a list of model caching constraints to use the specified GPU model, by PCI ID or product name. By default, the caching job detects all GPU models in the cluster and matches model profiles for all GPUs. The following partial specification requests a model profile that is compatible with an NVIDIA L40S. spec:
source:
ngc:
model:
...
gpus:
- ids:
- "26b5"
If GPU Operator or Node Feature Discovery is running on your cluster, you can determine the PCI IDs
for the GPU models on your nodes by viewing the node labels that begin with The following partial specification requests a model profile that is compatible with an NVIDIA A100. spec:
source:
ngc:
model:
...
gpus:
- product: "a100"
The product name, such as |
None |
|
When set to Refer to LoRA Models and Adapters for more information. |
|
|
Specifies the model profile quantization to match.
Common values are |
None |
|
Specifies an array of model profiles to cache. When you specify this field, automatic profile selection is disabled and all other The following partial specification requests a specific model profile. spec:
source:
ngc:
modelPuller: nvcr.io/nim/meta/llama3-8b-instruct:1.0.3
model:
profiles:
- 8835c31...
You can determine the model profiles by running the list-model-profiles command. You can specify |
None |
|
Specifies the model profile quality of service to match.
Values are |
None |
|
Specifies the model profile tensor parallelism to match.
The node that runs the NIM microservice and serves the model must have at least the specified number of GPUs.
Common values are |
None |
|
Specifies the container image that can cache model profiles. |
None |
|
Specifies the endpoint of the Multi-LLM model you want to cache.
Note that you cannot specify a model endpoint and a model in a NIM cache.
If you want to cache a LLM-Specific model, use |
None |
You can then create a NIM Service using the NIM Cache. The following shows a sample manifest of a NIM Service custom resource using the above NIM Cache:
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
name: meta-llama-3-2-1b-instruct
namespace: nim-service
spec:
image:
repository: nvcr.io/nim/meta/llama-3.2-1b-instruct
tag: "1.12.0"
pullPolicy: IfNotPresent
pullSecrets:
- ngc-secret
authSecret: ngc-api-secret
storage:
nimCache:
name: meta-llama-3-2-1b-instruct
profile: ''
replicas: 1
resources:
limits:
nvidia.com/gpu: 1
expose:
service:
type: ClusterIP
port: 8000
To use a NGC Mirrored Local Model Registry as a NIM Cache source, set NIM_REPOSITORY_OVERRIDE
as an environment variable for the NIM.
Refer to Repository Override for NVIDIA NIM for LLMs for more detailed instructions.
Refer to NIM for LLMs Environment Variables
for more information on the NIM_REPOSITORY_OVERRIDE
environment variable.
The following is a sample manifest:
# LLM NIM Cache with Mirrored Local Model Registry
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
name: meta-llama3-2-1b-instruct
namespace: nim-service
spec:
env:
- name: NIM_REPOSITORY_OVERRIDE
value: "https://<server-name>:<port>/"
source:
ngc:
modelPuller: nvcr.io/nim/meta/llama-3.2-1b-instruct:1.12.0
pullSecret: ngc-secret
authSecret: https-api-secret
model:
engine: "tensorrt"
tensorParallelism: "1"
storage:
pvc:
create: true
storageClass: ''
size: "50Gi"
volumeAccessMode: ReadWriteOnce
The NIM Cache fields relevant to Mirrored Local Model Registries are the same as for NVIDIA NGC Catalog as a NIM Cache Source.
For more sample manifests, refer to the config/samples/nim/caching/ngc-mirror/ directory.
Caching Locally Built LLM NIM Engines#
NIM Operator can cache dynamically built TensorRT-LLM engines using NIM Build. This allows you to generate TensorRT-LLM engines for the specific GPUs that you might have. Refer to Caching Locally Built LLM NIM Engines for more information.
LoRA Models and Adapters#
NVIDIA NIM for LLMs supports LoRA parameter-efficient fine-tuning (PEFT) adapters trained by the NeMo Framework and Hugging Face Transformers libraries.
You must download the LoRA adapters manually and make them available to the NIM microservice. The following steps describe one way to meet the requirement.
Specify
lora: true
in the NIM cache manifest that you apply:apiVersion: apps.nvidia.com/v1alpha1 kind: NIMCache metadata: name: meta-llama3-8b-instruct spec: source: ngc: modelPuller: nvcr.io/nim/meta/llama3-8b-instruct:1.0.3 pullSecret: ngc-secret authSecret: ngc-api-secret model: tensorParallelism: "1" lora: true gpus: - product: "a100" storage: pvc: create: true size: "50Gi" storageClass: <storage-class> volumeAccessMode: ReadWriteMany resources: {}
Apply the manifest and wait until
kubectl get nimcache -n nim-service meta-llama3-8b-instruct
shows the NIM cache isReady
.Use the NGC CLI or Hugging Face CLI to download the LoRA adapters.
Build a container that includes the CLIs, such as the following example:
FROM nvcr.io/nvidia/base/ubuntu:22.04_20240212 ARG NGC_CLI_VERSION=3.50.0 RUN apt-get update && DEBIAN_FRONTEND=noninteractive && \ apt-get install --no-install-recommends -y \ wget \ unzip \ python3-pip RUN useradd -m -s /bin/bash -u 1000 ubuntu USER ubuntu RUN wget --content-disposition --no-check-certificate \ https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/${NGC_CLI_VERSION}/files/ngccli_linux.zip -O /tmp/ngccli_linux.zip && \ unzip /tmp/ngccli_linux.zip -d ~ && \ rm /tmp/ngccli_linux.zip ENV PATH=/home/ubuntu/ngc-cli:$PATH RUN pip install -U "huggingface_hub[cli]"
Push the container to a registry that the nodes in your cluster can access.
Apply a manifest like the following example that runs the container.
When you create your manifest, refer to the key considerations for the pod specification:
Mount the same PVC that the NIM microservice accesses for the model.
Specify the same user ID and group ID that the NIM cache container used. The following manifest shows the default values.
Specify the
NGC_CLI_API_KEY
andNGC_CLI_ORG
environment variables. The value for the organization might be different.Start the pod in the
nim-service
namespace so that the pod can access thengc-secret
secret.
apiVersion: v1 kind: Pod metadata: name: ngc-cli spec: containers: - env: - name: NGC_CLI_API_KEY valueFrom: secretKeyRef: key: NGC_API_KEY name: ngc-api-secret - name: NGC_CLI_ORG value: "nemo-microservices/ea-participants" - name: NIM_PEFT_SOURCE value: "/model-store/loras" image: <private-registry>/<image-name>:<image-tag> command: ["sleep"] args: ["inf"] name: ngc-cli securityContext: capabilities: drop: - ALL runAsNonRoot: true volumeMounts: - mountPath: /model-store name: model-store restartPolicy: Never securityContext: fsGroup: 2000 runAsGroup: 2000 runAsUser: 1000 seLinuxOptions: level: s0:c28,c2 volumes: - name: model-store persistentVolumeClaim: claimName: meta-llama3-8b-instruct-pvc
Access the pod:
$ kubectl exec -it -n nim-service ngc-cli -- bash
The pod might report
groups: cannot find name for group ID 2000
. You can ignore the message.From the terminal in the pod, download the LoRA adapters.
Make a directory for the LoRA adapters:
$ mkdir $NIM_PEFT_SOURCE $ cd $NIM_PEFT_SOURCE
Download the adapters:
$ ngc registry model download-version "nim/meta/llama3-8b-instruct-lora:nemo-math-v1" $ ngc registry model download-version "nim/meta/llama3-8b-instruct-lora:nemo-squad-v1"
Rename the directories to match the naming convention for the LoRA model directory structure:
$ mv llama3-8b-instruct-lora_vnemo-math-v1 llama3-8b-math $ mv llama3-8b-instruct-lora_vnemo-squad-v1 llama3-8b-squad
Press Ctrl+D to exit the pod and then run
kubectl delete pod -n nim-service ngc-cli
.
When you create a NIM service instance, specify the NIM_PEFT_SOURCE
environment variable:
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
name: meta-llama3-8b-instruct
spec:
...
env:
- name: NIM_PEFT_SOURCE
value: "/model-store/loras"
After the NIM microservice is running, monitor the logs for records like the following:
{"level": "INFO", ..., "message": "LoRA models synchronizer successfully initialized!"}
{"level": "INFO", ..., "message": "Synchronizing LoRA models with local LoRA directory ..."}
{"level": "INFO", ..., "message": "Done synchronizing LoRA models with local LoRA directory"}
The preceding steps are sample commands for downloading the NeMo format LoRA adapters. Refer to Parameter-Efficient Fine-Tuning in the NVIDIA NIM for LLMs documentation for information about using the Hugging Face Transformers format, the model directory structure, and adapters for other models.
Caching Multi-LLM Compatible NIM#
To get the container for the Multi-LLM NIM, refer to LLM NIM Overview in the NVIDIA NGC Catalog.
Multi-LLM NIM Cache Sources#
You can pull from registries using the following sources and protocols:
When you create a NIM cache resource with the Hugging Face Hub as the source, the NIM Operator generates a Hugging Face CLI command to pull the requested model or dataset from Hugging Face, using the inputs from the spec.source.hf
object.
To cache models or datasets from Hugging Face Hub, you must have a Hugging Face Hub account and an user access token.
Your desired models or datasets should be available in your Hugging Face Hub account.
Create a Kubernetes secret with your Hugging Face Hub user access token to use as the spec.source.hf.authSecret
.
Note
When downloading models from Hugging Face for Multi-LLM NIM, log files can show permission denied warnings that can be safely ignored.
The following NIM Cache custom resource shows a sample manifest of using the Hugging Face Hub as a cache source:
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMCache
metadata:
name: nim-cache-multi-llm
namespace: nim-service
spec:
source:
hf:
endpoint: "https://huggingface.co"
namespace: "meta-llama"
authSecret: hf-api-secret
modelPuller: nvcr.io/nim/nvidia/llm-nim:1.12
pullSecret: ngc-secret
modelName: "Llama-3.2-1B-Instruct"
storage:
pvc:
create: true
storageClass: ''
size: "50Gi"
volumeAccessMode: ReadWriteOnce
If you want to cache a dataset, use spec.source.hf.datasetName: "dataset-name"
instead of spec.source.hf.modelName
.
A cache can only be configured to pull a model or a dataset, not both.
Refer to the following table for information about NIM Cache fields pertaining to Hugging Face Hub as a NIM Cache source:
Field |
Description |
Default Value |
---|---|---|
|
Specifies the name of the secret containing your “HF_TOKEN” token. Required if you are using Hugging Face Hub as a source of your model or dataset. |
None |
|
Specifies the name of the dataset from HuggingFace. Required if you are using Hugging Face Hub as a source of your dataset. |
None |
|
Specifies the HuggingFace endpoint. Required if you are using Hugging Face Hub as a source of your model or dataset. |
None |
|
Specifies the name of the model you want to use from HuggingFace. Required if you are using Hugging Face Hub as a source of your model. |
None |
|
Specifies the containerized huggingface-cli image to pull the data. Required if you are using Hugging Face Hub as a source of your model or dataset. |
None |
|
Specifies the namespace in the Hugging Face Hub. Required if you are using Hugging Face Hub as a source of your model or dataset. |
None |
|
Specifies the name of the image pull secret for the modelPuller image. Required if you are using Hugging Face Hub as a source of your model or dataset. |
None |
You can then create a NIM Service using the NIM Cache. The following shows a sample manifest of a NIM Service custom resource using the above NIM Cache:
apiVersion: apps.nvidia.com/v1alpha1
kind: NIMService
metadata:
name: meta-llama-3-2-1b-instruct
namespace: nim-service
spec:
image:
repository: nvcr.io/nim/nvidia/llm-nim
tag: "1.12"
pullPolicy: IfNotPresent
pullSecrets:
- ngc-secret
authSecret: ngc-api-secret
storage:
nimCache:
name: nim-cache-multi-llm
profile: 'tensorrt_llm'
resources:
limits:
nvidia.com/gpu: 1
cpu: "12"
memory: 32Gi
requests:
nvidia.com/gpu: 1
cpu: "4"
memory: 6Gi
replicas: 1
expose:
service:
type: ClusterIP
port: 8000