Deploy with Helm for NVIDIA NIM for LLMs#

NIMs are intended to be run on a system with NVIDIA GPUs, with the type and number of GPUs depending on the model. To use helm, you must have a Kubernetes cluster with appropriate GPU nodes and the GPU Operator installed.

Prerequisites#

If you haven’t set up your NGC API key and do not know exactly which NIM you want to download and deploy, see the information in Get Started with NIM.

After you have set your NGC API key, go to the NGC Catalog and select the nim-llm helm chart to pick a version. In most cases, you should select the latest version.

Use the following command to download the helm chart:

helm fetch https://helm.ngc.nvidia.com/nim/charts/nim-llm-<version_number>.tgz --username='$oauthtoken' --password=$NGC_API_KEY

This downloads the chart as a file to your local machine.

Configuring Helm#

The following helm options are the most important options to configure to deploy a NIM using Kubernetes:

image.repository – The container/NIM to deploy
image.tag – The version of that container/NIM
Storage options, based on the environment and cluster in use
model.ngcAPISecret and imagePullSecrets to communicate with NGC
resources – Use this option when a model requires more than the default of one GPU. See Supported Models for NVIDIA NIM for LLMs for details about the GPUs to request to meet the GPU memory requirements of the model on the available hardware.

env – An array of environment variables presented to the container, if advanced configuration is needed

Note: Do not override the following environment variables using the env value. Instead, use the indicated helm options:

Environment Variable	Helm Value
`NIM_MODEL_NAME`	`my/modelname`
`NIM_CACHE_PATH`	`model.nimCache`
`NGC_API_KEY`	`model.ngcAPISecret`
`NIM_SERVER_PORT`	`model.openaiPort`
`NIM_JSONL_LOGGING`	`model.jsonLogging`
`NIM_LOG_LEVEL`	`model.logLevel`
`HF_TOKEN`	`model.huggingfaceToken`

In these cases, set the helm values directly instead of relying on the environment variable values. You can add other environment variables to the env section of a values file.

Note

NIM_MODEL_NAME and HF_TOKEN are required for LLM-agnostic containers to download HuggingFace model URIs.

To adapt the chart’s deployment behavior to your cluster’s needs, refer to the helm chart’s README, which lists and describes the configuration options. This README is available on the helm command line, but the output is bare markdown. Output it to a file and open with a markdown renderer or use a command line tool such as glow to render in the terminal.

The following helm command displays the chart README and renders it in the terminal using glow:

helm show readme nim-llm-<version_number>.tgz | glow -p -

To examine all default values, run the following command:

helm show values nim-llm-<version_number>.tgz

Minimal Example#

This example requires that you have already established certain Kubernetes secrets in the deployment namespace to work before proceeding. The rest of this document will assume the default namespace.

To download the NIM container image, you must set an image pull secret, which is ngc-secret in the following example. To download model engines or weights from NGC, the chart requires a generic secret that has an NGC API key as a value stored in a key named NGC_API_KEY. The following example creates these two values:

kubectl create secret docker-registry ngc-secret --docker-server=nvcr.io --docker-username='$oauthtoken' --docker-password=$NGC_API_KEY

kubectl create secret generic ngc-api --from-literal=NGC_API_KEY=$NGC_API_KEY

Create the file custom-values.yaml with the following entries. These values will work in most clusters after the secrets are created above.

Note

When deploying NIMs with an arbitrary supported model, you may use the the NIM LLM container by setting image.repository to nvcr.io/nim/nvidia/llm-nim and explicitly setting the desired model name.

image:
  repository: "nvcr.io/nim/meta/llama3-8b-instruct" # container location
  tag: 1.0.3 # NIM version you want to deploy
model:
  ngcAPISecret: ngc-api  # name of a secret in the cluster that includes a key named NGC_API_KEY and is an NGC API key
persistence:
  enabled: true
imagePullSecrets:
  - name: ngc-secret # name of a secret used to pull nvcr.io images, see https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/

You can adapt the previous configuration to deploy any model, such as llama3-70b-instruct, by adjusting to the model’s requirements and size. For example:

image:
  repository: "nvcr.io/nim/meta/llama3-70b-instruct" # container location -- changed for the different model
  tag: 1.0.3
model:
  ngcAPISecret: ngc-api
persistence:
  enabled: true
  size: 220Gi # the model files will be quite large
resources:
  limits:
    nvidia.com/gpu: 4  # much more GPU memory is required
imagePullSecrets:
  - name: ngc-secret

Refer to the Supported Models for NVIDIA NIM for LLMs section to determine whether your hardware is sufficient to run this NIM.

Note

For configuration details concerning multi-node models, see the Multi-Node Deployment for NVIDIA NIM for LLMs section.

Storage#

Running out of storage space is always a concern when setting up NIMs, and downloading models can delay scaling in a cluster. Models can be quite large, and a cluster operator can quickly fill disk space when downloading them. Be sure to mount some type of persistent storage for the model cache on your pod. You have the following mutually-exclusive options when storing objects outside of the default of an emptyDir:

Persistent Volume Claims (enabled with persistence.enabled)
- Used when persistence.accessMode is set to “ReadWriteMany” where several pods can share one PVC.
- If statefulSet.enabled is set to false (default is true), this will create a PVC with a deployment, but if the access mode is not ReadWriteMany, such as with an NFS provisioner, scaling beyond one pod will likely fail.
Persistent Volume Claim templates (enabled with persistence.enabled and leaving statefulSet.enabled as default)
- Useful for scaling using a strategy of scaling up the StatefulSet to download the model to each PVC created for a maximum replicas desired, and then scaling down again, leaving those PVCs in place to allow fast scaling up.
Direct NFS (enabled with nfs.enabled)
- Kubernetes does not allow setting of mount options on direct NFS, so some special cluster setup may be required.
hostPath (enabled with hostPath.enabled)
- Know the security implications of using hostPath and understand that this will also tie pods to one node.

Note

If you need to deploy behind a proxy, see Deploy Behind a TLS Proxy for NVIDIA NIM for LLMs for more information.

Enabling Open Telemetry Tracing and Metrics

env:
  - name: NIM_ENABLE_OTEL
    value: "1"
  - name: NIM_OTEL_SERVICE_NAME
    value: <name of the service>
  - name: NIM_OTEL_TRACES_EXPORTER
    value: otlp
  - name: NIM_OTEL_METRICS_EXPORTER
    value: otlp 
  - name: HOST_IP
    valueFrom:
      fieldRef:
        fieldPath: status.hostIP
  - name: NIM_OTEL_EXPORTER_OTLP_ENDPOINT
    value: "http://$(HOST_IP):4318" 

NVIDIA recommends that during the installation of OpenTelemetry collectors in Kubernetes, you set these environment variables in a custom values.yaml file to enable trace and metrics collection through OpenTelemetry. This version requires that you configure the collector to run using the host ports and install it as a DaemonSet. If you use a different configuration when installing the collector, set the NIM_OTEL_EXPORTER_OTLP_ENDPOINT variable to the correct ingestion URL.

Refer to Environment Variables for detailed explanations of environment variables.

Launching NIM in Kubernetes#

You are now ready to launch the chart.

helm install my-nim nim-llm-<version_number>.tgz -f path/to/your/custom-values.yaml

Wait for the pod to reach “Ready” status.

Running Inference#

In the previous example the OpenAI compatible API endpoint is exposed on port 8000 through the Kubernetes service of the default type with no ingress, since authentication is not handled by the NIM itself. The following commands assume the Llama 3 8B Instruct model was deployed. Adjust the “model” value in the request JSON body to use a different model.

Use the following command to port-forward the service to your local machine to test inference.

kubectl port-forward service/my-nim-nim-llm 8000:http-openai

Then try a request:

curl -X 'POST' \
  'http://localhost:8000/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "messages": [
    {
      "content": "You are a polite and respectful chatbot helping people plan a vacation.",
      "role": "system"
    },
    {
      "content": "What should I do for a 4 day vacation in Spain?",
      "role": "user"
    }
  ],
  "model": "meta/llama3-8b-instruct",
  "max_tokens": 16,
  "top_p": 1,
  "n": 1,
  "stream": false,
  "stop": "\n",
  "frequency_penalty": 0.0
}'

Troubleshooting FAQ#

Q: What should I do if my pod is stuck in a “Pending” state? A: Try running kubectl describe pod <pod name>, and check the Events section to see what the scheduler is waiting for. Node taints that may need to be tolerated, insufficient GPUs, and storage mount issues are all common reasons.

Q: I tried to scale or upgrade a deployment using statefulset.enabled: false and persistence.enabled: true. Why are pods never starting? A: To scale or upgrade without using StatefulSet PVC templates, which is not very efficient in either time or storage, you must use a ReadWriteMany storage class so that it can be mounted on separate nodes, manually cloned ReadOnlyMany volumes or something like direct NFS storage. Without persistence, every starting pod must download its model to an emptyDir volume. A ReadWriteMany storage class such NFS PVC provisioner or CephFS provisioner is ideal.

Q: One of the last log messages was about, “Preparing model workspace. This step might download additional files to run the model.” Why did it fail during that? A: It is likely that the model weights had not finished downloading, but Kubernetes hit a threshold of failures for startup probes. Try increasing startupProbe.failureThreshold. This is especially likely with large models or very slow network connections.

Additional Information#

The helm chart’s internal README includes the following parameters. NVIDIA recommends that you use the chart version within the downloaded README as it has the most correct and up to date version of these parameters for that chart version.

Parameters#

Deployment parameters#

Name	Description	Value
`affinity`	[default: {}] Affinity settings for deployment.	`{}`
`containerSecurityContext`	Sets privilege and access control settings for container (Only affects the main container, not pod-level).	`{}`
`customCommand`	Overrides command line options sent to the NIM with the array listed here.	`[]`
`customArgs`	Overrides command line arguments of the NIM container with the array listed here.	`[]`
`env`	Adds arbitrary environment variables to the main container.	`[]`
`extraVolumes`	Adds arbitrary additional volumes to the deployment set definition.	`{}`
`extraVolumeMounts`	Adds volume mounts to the main container from `extraVolumes`.	`{}`
`image.repository`	Specifies the NIM-LLM Image to deploy.	`""`
`image.tag`	Specifies the image tag or version.	`""`
`image.pullPolicy`	Sets the image pull policy.	`""`
`imagePullSecrets`	Specifies a list of secret names that are needed for the main container and any init containers.
`initContainers`	Specifies model init containers, if needed.
`initContainers.extraInit`	Fully specify any additional init containers your use case requires.	`[]`
`nodeSelector`	Sets node selectors for the NIM – for example `nvidia.com/gpu.present: "true"`.	`{}`
`podAnnotations`	Sets additional annotations on the main deployment pods.	`{}`
`podSecurityContext`	Specifies security context settings for pod.
`podSecurityContext.runAsUser`	Specify user UID for pod.	`1000`
`podSecurityContext.runAsGroup`	Specify group ID for pod.	`1000`
`podSecurityContext.fsGroup`	Specify file system owner group id.	`1000`
`proxyCA`	Specify a certificate for a custom proxy. When `proxyCA` is set native TLS is used for downloading from NGC.
`proxyCA.enabled`	Specify true if NIM is run behind a proxy.	`false`
`proxyCA.secretName`	Specify a name of the Kubernetes secret containing the certificate. The secret is created before the deployment. Must be used together with `proxyCA.keyName`	`""`
`proxyCA.keyName`	Specify a name of the key inside the secret which contains the certificate. Must be used together with `proxyCA.secretName`	`""`
`replicaCount`	Specify static replica count for deployment.	`1`
`resources`	Specify resources limits and requests for the running service.
`resources.limits.nvidia.com/gpu`	Specify number of GPUs to present to the running service.	`1`
`serviceAccount`	Options to specify service account for the deployment.
`serviceAccount.create`	Specifies whether a service account should be created.	`false`
`serviceAccount.annotations`	Sets annotations to be added to the service account.	`{}`
`serviceAccount.name`	Specifies the name of the service account to use. If it is not set and create is `true`, a name is generated using a `fullname` template.	`""`
`statefulSet.enabled`	Enables `statefulset` deployment. Enabling `statefulSet` allows PVC templates for scaling. If using central PVC with RWX `accessMode`, this isn’t needed.	`true`
`tolerations`	Specify tolerations for pod assignment. Allows the scheduler to schedule pods with matching taints.

Autoscaling parameters#

Values used for creating a Horizontal Pod Autoscaler. If autoscaling is not enabled, the rest are ignored. NVIDIA recommends usage of the custom metrics API, commonly implemented with the Prometheus Adapter. Standard metrics of CPU and memory are of limited use in scaling NIM.

Name	Description	Value
`autoscaling.enabled`	Enables horizontal pod autoscaler.	`false`
`autoscaling.minReplicas`	Specify minimum replicas for autoscaling.	`1`
`autoscaling.maxReplicas`	Specify maximum replicas for autoscaling.	`10`
`autoscaling.metrics`	Array of metrics for autoscaling.	`[]`

Ingress parameters#

Name	Description	Value
`ingress.enabled`	Enables ingress.	`false`
`ingress.className`	Specify class name for Ingress.	`""`
`ingress.annotations`	Specify additional annotations for ingress.	`{}`
`ingress.hosts`	Specify list of hosts each containing lists of paths.
`ingress.hosts[0].host`	Specify name of host.	`chart-example.local`
`ingress.hosts[0].paths[0].path`	Specify ingress path.	`/`
`ingress.hosts[0].paths[0].pathType`	Specify path type.	`ImplementationSpecific`
`ingress.hosts[0].paths[0].serviceType`	Specify service type. It can be can be `nemo` or `openai` – make sure your model serves the appropriate port(s).	`openai`
`ingress.tls`	Specify list of pairs of TLS `secretName` and hosts.	`[]`

Probe parameters#

Name	Description	Value
`livenessProbe.enabled`	Enables `livenessProbe``.	`true`
`livenessProbe.method`	`LivenessProbe` `http` or `script`, but no script is currently provided.	`http`
`livenessProbe.command`	`LivenessProbe`` script command to use (unsupported at this time).	`["myscript.sh"]`
`livenessProbe.path`	`LivenessProbe`` endpoint path.	`/v1/health/live`
`livenessProbe.initialDelaySeconds`	Initial delay seconds for `livenessProbe`.	`15`
`livenessProbe.timeoutSeconds`	Timeout seconds for `livenessProbe`.	`1`
`livenessProbe.periodSeconds`	Period seconds for `livenessProbe`.	`10`
`livenessProbe.successThreshold`	Success threshold for `livenessProbe`.	`1`
`livenessProbe.failureThreshold`	Failure threshold for `livenessProbe`.	`3`
`readinessProbe.enabled`	Enables `readinessProbe`.	`true`
`readinessProbe.path`	Readiness Endpoint Path.	`/v1/health/ready`
`readinessProbe.initialDelaySeconds`	Initial delay seconds for `readinessProbe`.	`15`
`readinessProbe.timeoutSeconds`	Timeout seconds for `readinessProbe`.	`1`
`readinessProbe.periodSeconds`	Period seconds for `readinessProbe`.	`10`
`readinessProbe.successThreshold`	Success threshold for `readinessProbe`.	`1`
`readinessProbe.failureThreshold`	Failure threshold for `readinessProbe`.	`3`
`startupProbe.enabled`	Enables `startupProbe`.	`true`
`startupProbe.path`	`StartupProbe` Endpoint Path.	`/v1/health/ready`
`startupProbe.initialDelaySeconds`	Initial delay seconds for `startupProbe`.	`40`
`startupProbe.timeoutSeconds`	Timeout seconds for `startupProbe`.	`1`
`startupProbe.periodSeconds`	Period seconds for `startupProbe`.	`10`
`startupProbe.successThreshold`	Success threshold for `startupProbe`.	`1`
`startupProbe.failureThreshold`	Failure threshold for `startupProbe`.	`180`

Metrics parameters#

Name	Description	Value
`metrics`
`serviceMonitor`	Options for `serviceMonitor` to use the Prometheus Operator and the primary service object.
`metrics.serviceMonitor.enabled`	Enables `serviceMonitor` creation.	`false`
`metrics.serviceMonitor.additionalLabels`	Specify additional labels for `serviceMonitor`.	`{}`

Models parameters#

Name	Description	Value
`model.nimCache`	Path to mount writeable storage or pre-filled model cache for the NIM.	`""`
`model.name`	Specifies the name of the model in the API (usually, the name of the NIM). This is mostly used for helm tests and is usually otherwise optional. This must match the name from /v1/models to allow `helm test <release-name>` to work.	`meta/llama3-8b-instruct`
`model.ngcAPISecret`	Name of pre-existing secret with a key named `NGC_API_KEY` that contains an API key for NGC model downloads.	`""`
`model.ngcAPIKey`	NGC API key literal to use as the API secret and image pull secret when set.	`""`
`model.openaiPort`	Specifies the Open AI API Port.	`8000`
`model.labels`	Specifies extra labels to be added on deployed pods.	`{}`
`model.jsonLogging`	Turn JSON lines logging on or off. Defaults to true.	`true`
`model.logLevel`	Log level of NIM service. Possible values of the variable are TRACE, DEBUG, INFO, WARNING, ERROR, CRITICAL.	`INFO`

Storage parameters#

Name	Description	Value
`persistence`	Specify settings to modify the behavior and use of persistent volumes for model weights.
`persistence.enabled`	Enables the use of persistent volumes.	`false`
`persistence.existingClaim`	Specifies an existing persistent volume claim. If using `existingClaim`, run only one replica or use a `ReadWriteMany` storage setup.	`""`
`persistence.storageClass`	Specifies the persistent volume storage class. If set to `"-"`, this disables dynamic provisioning. If left undefined or set to null, the cluster default storage provisioner is used.	`""`
`persistence.accessMode`	Specify `accessMode`. If using an NFS or similar setup, you can use `ReadWriteMany`.	`ReadWriteOnce`
`persistence.stsPersistentVolumeClaimRetentionPolicy.whenDeleted`	Specifies persistent volume claim retention policy when deleted. Only used with Stateful Set volume templates.	`Retain`
`persistence.stsPersistentVolumeClaimRetentionPolicy.whenScaled`	Specifies persistent volume claim retention policy when scaled. Only used with Stateful Set volume templates.	`Retain`
`persistence.size`	Specifies the size of the persistent volume claim (for example 40Gi).	`50Gi`
`persistence.annotations`	Adds annotations to the persistent volume claim.	`{}`
`hostPath`	Configures model cache on local disk on the nodes using `hostPath` – for special cases. You should understand the security implications before using this option.
`hostPath.enabled`	Enable `hostPath`.	`false`
`hostPath.path`	Specifies path on the node used as a `hostPath` volume.	`/model-store`
`nfs`	Configures the model cache to sit on shared direct-mounted NFS. NOTE: you cannot set mount options using direct NFS mount to pods without a node-intalled nfsmount.conf. An NFS-based `PersistentVolumeClaim` is likely better in most cases.
`nfs.enabled`	Enables direct pod NFS mount.	`false`
`nfs.path`	Specify path on NFS server to mount.	`/exports`
`nfs.server`	Specify NFS server address.	`nfs-server.example.com`
`nfs.readOnly`	Set to true to mount as read-only.	`false`

Service parameters#

Name	Description	Value
`service.type`	Specifies the service type for the deployment.	`ClusterIP`
`service.name`	Overrides the default service name	`""`
`service.openaiPort`	Specifies Open AI Port for the service.	`8000`
`service.annotations`	Specify additional annotations to be added to service.	`{}`
`service.labels`	Specifies additional labels to be added to service.	`{}`

Multi-node parameters#

Large models that must span multiple nodes do not work on plain Kubernetes with the GPU Operator alone at this time. Optimized TensorRT profiles, when selected automatically or by environment variable, require one of the following:

(Recommended) LeaderWorkerSets—We recommend that you use LeaderWorkerSets if your cluster version allows it.
(Not recommended) MPI Operator—Since MPIJob is a batch-type resource that is not designed with service stability and reliability in mind, this option is not recommended. Only optimized profiles are supported for multi-node deployment at this time.

Name	Description	Value
`multiNode.enabled`	Specify true for multi-node deployments.	`false`
`multiNode.clusterStartTimeout`	Set the number of seconds to wait for worker nodes to start before failing.	`300`
`multiNode.gpusPerNode`	Number of GPUs for each pod. In most cases, this should match `resources.limits.nvidia.com/gpu`.	`1`
`multiNode.workers`	Specifies how many worker pods per multi-node replica to launch.	`1`
`multiNode.workerCustomCommand`	Sets a custom command array for the worker nodes in a LeaderWorkerSet only.	`[]`
`multiNode.leaderWorkerSet.enabled`	True to use `LeaderWorkerSets` for multi-node deployments (recommended). False to `MPIJob` from mpi-operator.	`true`
`multiNode.existingSSHSecret`	Sets the SSH private key for MPI to an existing secret. Otherwise, the Helm chart generates a key randomly during installation.	`""`
`multiNode.mpiJob.workerAnnotations`	Annotations only applied to workers for `MPIJob`, if used. This may be necessary to ensure the workers connect to `CNI`s offered by `multus` and the network operator, if used.	`{}`
`multiNode.mpiJob.launcherResources`	Resources section to apply only to the launcher pods in `MPIJob`, if used. Launchers do not get the chart resources restrictions. Only workers do, since they require GPUs.	`{}`
`multiNode.optimized.enabled`	True to enable optimized multi-node deployments. Currently, true is the only option.	`true`