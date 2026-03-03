Deploy with Helm for NVIDIA NIM for LLMs#

NIMs are intended to run on a system with NVIDIA GPUs, with the type and number of GPUs depending on the model. To use Helm, you must have a Kubernetes cluster with appropriate GPU nodes and the GPU Operator installed.

Prerequisites# If you have not set up your NGC API key and do not know exactly which NIM you want to download and deploy, refer to Get Started with NIM. After you have set your NGC API key, go to the NGC Catalog and select the nim-llm Helm chart to pick a version. In most cases, you should select the latest version. Use the following command to download the Helm chart: helm fetch https://helm.ngc.nvidia.com/nim/charts/nim-llm-<version_number>.tgz --username = '$oauthtoken' --password = $NGC_API_KEY This downloads the chart as a file to your local machine.

Configure Helm# The following Helm options are the most important options to configure to deploy a NIM using Kubernetes: image.repository – The container/NIM to deploy

image.tag – The version of that container/NIM

Storage options, based on the environment and cluster in use

model.ngcAPISecret and imagePullSecrets to communicate with NGC

resources – Use this option when a model requires more than the default of one GPU. Refer to Supported Models for NVIDIA NIM for LLMs for details about the GPUs to request to meet the GPU memory requirements of the model on the available hardware.

env – An array of environment variables presented to the container, if advanced configuration is needed Note: Do not override the following environment variables using the env value. Instead, use the indicated helm options: Environment Variable Helm Value NIM_MODEL_NAME my/modelname NIM_CACHE_PATH model.nimCache NGC_API_KEY model.ngcAPISecret NIM_SERVER_PORT model.openaiPort NIM_JSONL_LOGGING model.jsonLogging NIM_LOG_LEVEL model.logLevel HF_TOKEN model.hfTokenSecret In these cases, set the Helm values directly instead of relying on the environment variable values. You can add other environment variables to the env section of a values file. Note NIM_MODEL_NAME and HF_TOKEN are required for the multi-LLM compatible NIM container to download Hugging Face model URIs.

To adapt the chart’s deployment behavior to your cluster’s needs, refer to the Helm chart README, which lists and describes the configuration options. This README is available on the Helm command line, but the output is bare markdown. Output it to a file and open with a markdown renderer or use a command-line tool such as glow to render in the terminal. The following Helm command displays the chart README and renders it in the terminal using glow : helm show readme nim-llm-<version_number>.tgz | glow -p - To examine all default values, run the following command: helm show values nim-llm-<version_number>.tgz Minimal Example# This example requires that you have already established certain Kubernetes secrets in the deployment namespace to work before proceeding. The rest of this document assumes the default namespace. To download the NIM container image, you must set an image pull secret, which is ngc-secret in the following example. To download model engines or weights from NGC, the chart requires a generic secret that has an NGC API key as a value stored in a key named NGC_API_KEY . The following example creates these two values: kubectl create secret docker-registry ngc-secret --docker-server = nvcr.io --docker-username = '$oauthtoken' --docker-password = $NGC_API_KEY kubectl create secret generic ngc-api --from-literal = NGC_API_KEY = $NGC_API_KEY Create the file custom-values.yaml with the following entries. These values work in most clusters after the secrets are created above. Note When deploying NIMs with an arbitrary supported model, you can use the NIM LLM container by setting image.repository to nvcr.io/nim/nvidia/llm-nim and explicitly setting the desired model name. image : repository : "nvcr.io/nim/meta/llama3-8b-instruct" # container location tag : 1.0.3 # NIM version you want to deploy model : ngcAPISecret : ngc-api # name of a secret in the cluster that includes a key named NGC_API_KEY and is an NGC API key persistence : enabled : true imagePullSecrets : - name : ngc-secret # name of a secret used to pull nvcr.io images, see https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/ You can adapt the previous configuration to deploy any model, such as llama3-70b-instruct , by adjusting to the model’s requirements and size. For example: image : repository : "nvcr.io/nim/meta/llama3-70b-instruct" # container location -- changed for the different model tag : 1.0.3 model : ngcAPISecret : ngc-api persistence : enabled : true size : 220Gi # the model files will be quite large resources : limits : nvidia.com/gpu : 4 # much more GPU memory is required imagePullSecrets : - name : ngc-secret Refer to Supported Models for NVIDIA NIM for LLMs to determine whether your hardware is sufficient to run this NIM. Note For configuration details for multi-node models, refer to Multi-Node Deployment for NVIDIA NIM for LLMs.

Storage# Running out of storage space is always a concern when setting up NIMs, and downloading models can delay scaling in a cluster. Models can be large, and a cluster operator can quickly fill disk space when downloading them. Mount persistent storage for the model cache on your pod. You have the following mutually exclusive options when storing objects outside of the default of an emptyDir : Persistent Volume Claims (enabled with persistence.enabled ) Used when persistence.accessMode is set to “ReadWriteMany” where several pods can share one PVC. If statefulSet.enabled is set to false (default is true ), this will create a PVC with a deployment, but if the access mode is not ReadWriteMany , such as with an NFS provisioner, scaling beyond one pod will likely fail.

Persistent Volume Claim templates (enabled with persistence.enabled and leaving statefulSet.enabled as default) Useful for scaling using a strategy of scaling up the StatefulSet to download the model to each PVC created for a maximum replicas desired, and then scaling down again, leaving those PVCs in place to allow fast scaling up.

Direct NFS (enabled with nfs.enabled ) Kubernetes does not allow setting of mount options on direct NFS, so some special cluster setup may be required.

hostPath (enabled with hostPath.enabled ) Know the security implications of using hostPath and understand that this will also tie pods to one node.

Note If you need to deploy behind a proxy, refer to Deploy Behind a TLS Proxy for NVIDIA NIM for LLMs for more information.

Enable OpenTelemetry Tracing and Metrics# env : - name : NIM_ENABLE_OTEL value : "1" - name : NIM_OTEL_SERVICE_NAME value : <name of the service> - name : OTEL_TRACES_EXPORTER value : otlp - name : OTEL_METRICS_EXPORTER value : otlp - name : HOST_IP valueFrom : fieldRef : fieldPath : status.hostIP - name : OTEL_EXPORTER_OTLP_ENDPOINT value : "http://$(HOST_IP):4318" NVIDIA recommends that you set these environment variables in a custom values.yaml file during the installation of OpenTelemetry collectors in Kubernetes to enable trace and metrics collection. This version requires that you configure the collector to run using the host ports and install it as a DaemonSet. If you use a different configuration when installing the collector, set the OTEL_EXPORTER_OTLP_ENDPOINT variable to the correct ingestion URL. Refer to Environment Variables for detailed explanations of environment variables.

Launch NIM in Kubernetes# You are now ready to launch the chart. helm install my-nim nim-llm-<version_number>.tgz -f path/to/your/custom-values.yaml Wait for the pod to reach “Ready” status.

Run Inference# In the previous example, the OpenAI-compatible API endpoint is exposed on port 8000 through the Kubernetes service of the default type with no ingress, since authentication is not handled by the NIM itself. The following commands assume the Llama 3 8B Instruct model was deployed. Adjust the “model” value in the request JSON body to use a different model. Use the following command to port-forward the service to your local machine to test inference. kubectl port-forward service/my-nim-nim-llm 8000 :http-openai Then try a request: curl -X 'POST' \ 'http://localhost:8000/v1/chat/completions' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "messages": [ { "content": "You are a polite and respectful chatbot helping people plan a vacation.", "role": "system" }, { "content": "What should I do for a 4 day vacation in Spain?", "role": "user" } ], "model": "meta/llama3-8b-instruct", "max_tokens": 16, "top_p": 1, "n": 1, "stream": false, "stop": "

", "frequency_penalty": 0.0 }'

Troubleshooting FAQ# Q: What should I do if my pod is stuck in a “Pending” state? A: Try running kubectl describe pod <pod name> , and check the Events section to see what the scheduler is waiting for. Node taints that may need to be tolerated, insufficient GPUs, and storage mount issues are all common reasons. Q: I tried to scale or upgrade a deployment using statefulset.enabled: false and persistence.enabled: true . Why are pods never starting? A: To scale or upgrade without using StatefulSet PVC templates, which is not very efficient in either time or storage, you must use a ReadWriteMany storage class so that it can be mounted on separate nodes, manually cloned ReadOnlyMany volumes or something like direct NFS storage. Without persistence, every starting pod must download its model to an emptyDir volume. A ReadWriteMany storage class such NFS PVC provisioner or CephFS provisioner is ideal. Q: One of the last log messages was about, “Preparing model workspace. This step might download additional files to run the model.” Why did it fail during that? A: It is likely that the model weights had not finished downloading, but Kubernetes hit a threshold of failures for startup probes. Try increasing startupProbe.failureThreshold . This is especially likely with large models or very slow network connections.

Additional Information# The helm chart’s internal README includes the following parameters. NVIDIA recommends that you use the chart version within the downloaded README as it has the most correct and up to date version of these parameters for that chart version.