Large Language Models (1.1.0)
Large Language Models (1.1.0)

Deploying with Helm

NIMs are intended to be run on a system with NVIDIA GPUs, with the type and number of GPUs depending on the model. To use helm, you must have a Kubernetes cluster with appropriate GPU nodes and the GPU Operator installed.

For requirements, including the type and number of GPUs, see Support Matrix.

If you haven’t set up your NGC API key and do not know exactly which NIM you want to download and deploy, see the information in Getting Started.

Once you have set your NGC API key), go to the nim-llm to pick a version. In most cases, you should select the latest version.

Use the following command to download the helm chart:

Copy
Copied!
            

helm fetch https://helm.ngc.nvidia.com/nim/charts/nim-llm-<version_number>.tgz" --username=\$oauthtoken --password=$NGC_API_KEY

This downloads the chart as a file to your local machine.

The following helm options are the most important options to configure to deploy a NIM using Kubernetes:

  • image.repository – The container/NIM to deploy

  • image.tag – The version of that container/NIM

  • Storage options, based on the environment and cluster in use

  • model.ngcAPISecret and imagePullSecrets to communicate with NGC

  • resources – Use this option when a model requires more than the default of one GPU. See support matrix for details about the GPUs to request to meet the GPU memory requirements of the model on the available hardware.

  • env – Which is an array of environment variables presented to the container, if advanced configuration is needed

    • Note: Do not set the following environment variables using the env value. Instead, use the indicated helm options:

      Environment Variable

      Helm Value

      NIM_CACHE_PATH model.nimCache
      NGC_API_KEY model.ngcAPISecret
      NIM_SERVER_PORT model.openaiPort
      NIM_JSONL_LOGGING model.jsonLogging
      NIM_LOG_LEVEL model.logLevel
      In these cases, set the helm values directly instead of relying on the environment variable values. You can add other environment variables to the env section of a values file.

To adapt the chart’s deployment behavior to your cluster’s needs, refer to the helm chart’s README, which lists and describes the configuration options. This README is available on the helm command line, but the output is bare markdown. Output it to a file and open with a markdown renderer or use a command line tool such as glow to render in the terminal.

The following helm command displays the chart README and renders it in the terminal using glow:

Copy
Copied!
            

helm show readme nim-llm-<version_number>.tgz | glow -p -

To examine all default values, run the following command:

Copy
Copied!
            

helm show values nim-llm-<version_number>.tgz

Minimal example

This example requires that you have already established certain Kubernetes secrets in the deployment namespace to work before proceeding. The rest of this document will assume the default namespace.

To download the NIM container image, you must set an image pull secret, which is ngc-secret in the following example. To download model engines or weights from NGC, the chart requires a generic secret that has an NGC API key as a value stored in a key named NGC_API_KEY. The following example creates these two values:

Copy
Copied!
            

kubectl create secret docker-registry ngc-secret --docker-server=nvcr.io --docker-username='$oauthtoken' --docker-password=$NGC_API_KEY kubectl create secret generic ngc-api --from-literal=NGC_API_KEY=$NGC_API_KEY

Create the file custom-values.yaml with the following entries. These values will work in most clusters after the secrets are created above.

Copy
Copied!
            

image: repository: "nvcr.io/nim/meta/llama3-8b-instruct" # container location tag: 1.0.0 # NIM version you want to deploy model: ngcAPISecret: ngc-api # name of a secret in the cluster that includes a key named NGC_API_KEY and is an NGC API key persistence: enabled: true imagePullSecrets: - name: ngc-secret # name of a secret used to pull nvcr.io images, see https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/

You can adapt the previous configuration to deploy any model, such as llama3-70b-instruct, by adjusting to the model’s requirements and size. For example:

Copy
Copied!
            

image: repository: "nvcr.io/nim/meta/llama3-70b-instruct" # container location -- changed for the different model tag: 1.0.0 model: ngcAPISecret: ngc-api persistence: enabled: true size: 220Gi # the model files will be quite large resources: limits: nvidia.com/gpu: 4 # much more GPU memory is required imagePullSecrets: - name: ngc-secret

Refer to the Support Matrix section to determine whether your hardware is sufficient to run this NIM.

Running out of storage space is always a concern when setting up NIMs, and downloading models can delay scaling in a cluster. Models can be quite large, and a cluster operator can quickly fill disk space when downloading them. Be sure to mount some type of persistent storage for the model cache on your pod. You have the following mutually-exclusive options when storing objects outside of the default of an emptyDir:

  • Persistent Volume Claims (enabled with persistence.enabled)

    • Used when persistence.accessMode is set to “ReadWriteMany” where several pods can share one PVC.

    • If statefulSet.enabled is set to false (default is true), this will create a PVC with a deployment, but if the access mode is not ReadWriteMany, such as with an NFS provisioner, scaling beyond one pod will likely fail.

  • Persistent Volume Claim templates (enabled with persistence.enabled and leaving statefulSet.enabled as default)

    • Useful for scaling using a strategy of scaling up the StatefulSet to download the model to each PVC created for a maximum replicas desired, and then scaling down again, leaving those PVCs in place to allow fast scaling up.

  • Direct NFS (enabled with nfs.enabled)

    • Kubernetes does not allow setting of mount options on direct NFS, so some special cluster setup may be required.

  • hostPath (enabled with hostPath.enabled)

    • Know the security implications of using hostPath and understand that this will also tie pods to one node.

Note

Requires NIM version 1.1.0+

Two options exist for deploying multi-node NIMs on Kubernetes: LeaderWorkerSets and MPI Jobs using the MPI Operator.

LeaderWorkerSet

Note

Requires Kubernetes version >1.26

LeaderWorkerSet (LWS) deployments are the recommended method for deploying Multi-Node models with NIM. To enable LWS deployments, see the installation instructions in the LWS documentation. The helm chart defaults to LWS for multi-node deployment.

With LWS deployments, you will see Leader and Worker pods that coordinate together to run your multi-node models.

LWS deployments support manual scaling and auto scaling, where the entire set of pods are treated as a single replica. However, there are some limitations to scaling when using LWS deployments. If scaling manually (autoscaling is not enabled), you cannot scale above the initial number of replicas set in the helm chart.

Use the following example values file to deploy the Llama 3.1 405B model using this method. Refer to the Support Matrix section to determine whether your hardware is sufficient to run this model.

Copy
Copied!
            

image: # Adjust to the actual location of the image and version you want repository: nvcr.io/nim/meta/llama-3.1-405b-instruct tag: 1.1.2 imagePullSecrets: - name: ngc-secret model: name: meta/llama-3_1-405b-instruct ngcAPISecret: ngc-api # NVIDIA recommends using an NFS-style read-write-many storage class. # All nodes will need to mount the storage. In this example, we assume a storage class exists name "nfs". persistence: enabled: true size: 1000Gi accessMode: ReadWriteMany storageClass: nfs annotations: helm.sh/resource-policy: "keep" # This should match `multiNode.gpusPerNode` resources: limits: nvidia.com/gpu: 8 multiNode: enabled: true workers: 2 gpusPerNode: 8 # Downloading the model will take quite a long time. Give it as much time as ends up being needed. startupProbe: failureThreshold: 1500

MPI Job

MPI Jobs using the MPI Operator are an alternative deployment option for clusters that don’t support LeaderWorkerSet (Kubernetes version less than v1.27). To enable MPI Jobs, install the MPI operator. This is a custom-values.yaml file example that disables LeaderWorkerSets and launches an MPI Job:

Copy
Copied!
            

image: # Adjust to the actual location of the image and version you want repository: nvcr.io/nim/meta/llama-3.1-405b-instruct tag: 1.1.2 imagePullSecrets: - name: ngc-secret model: name: meta/llama-3_1-405b-instruct ngcAPISecret: ngc-api # NVIDIA recommends using an NFS-style read-write-many storage class. # All nodes will need to mount the storage. In this example, we assume a storage class exists name "nfs". persistence: enabled: true size: 1000Gi accessMode: ReadWriteMany storageClass: nfs annotations: helm.sh/resource-policy: "keep" # This should match `multiNode.gpusPerNode` resources: limits: nvidia.com/gpu: 8 multiNode: enabled: true leaderWorkerSet: enabled: False workers: 2 gpusPerNode: 8 # Downloading the model will take quite a long time. Give it as much time as ends up being needed. startupProbe: failureThreshold: 1500

For MPI Jobs, you will see a launcher pod and one or more worker pods deployed for your model. The launcher pod does not require any GPUs, and deployment logs will be available through the launcher pod.

When deploying with MPI Jobs you can set a number of replicas, however dynamic scaling is not supported without redeploying the helm chart. MPI Jobs also do not automatically restart, so if any pod in the multi-node set fails, the job must be manually uninstalled and reinstalled to start it back up.

You are now ready to launch the chart.

Copy
Copied!
            

helm install my-nim nim-llm-<version_number>.tgz -f path/to/your/custom-values.yaml

Wait for the pod to reach “Ready” status.

In the previous example the OpenAI compatible API endpoint is exposed on port 8000 through the Kubernetes service of the default type with no ingress, since authentication is not handled by the NIM itself. The following commands assume the Llama 3 8B Instruct model was deployed. Adjust the “model” value in the request JSON body to use a different model.

Use the following command to port-forward the service to your local machine to test inference.

Copy
Copied!
            

kubectl port-forward service/my-nim-nim-llm 8000:http-openai

Then try a request:

Copy
Copied!
            

curl -X 'POST' \ 'http://localhost:8000/v1/chat/completions' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "messages": [ { "content": "You are a polite and respectful chatbot helping people plan a vacation.", "role": "system" }, { "content": "What should I do for a 4 day vacation in Spain?", "role": "user" } ], "model": "meta/llama3-8b-instruct", "max_tokens": 16, "top_p": 1, "n": 1, "stream": false, "stop": "\n", "frequency_penalty": 0.0 }'

Q: What should I do if my pod is stuck in a “Pending” state? A: Try running kubectl describe pod <pod name>, and check the Events section to see what the scheduler is waiting for. Node taints that may need to be tolerated, insufficient GPUs, and storage mount issues are all common reasons.

Q: I tried to scale or upgrade a deployment using statefulset.enabled: false and persistence.enabled: true. Why are pods never starting? A: To scale or upgrade without using StatefulSet PVC templates, which is not very efficient in either time or storage, you must use a ReadWriteMany storage class so that it can be mounted on separate nodes, manually cloned ReadOnlyMany volumes or something like direct NFS storage. Without persistence, every starting pod must download its model to an emptyDir volume. A ReadWriteMany storage class such NFS PVC provisioner or CephFS provisioner is ideal.

Q: One of the last log messages was about, “Preparing model workspace. This step might download additional files to run the model.” Why did it fail during that? A: It is likely that the model weights had not finished downloading, but Kubernetes hit a threshold of failures for startup probes. Try increasing startupProbe.failureThreshold. This is especially likely with large models or very slow network connections.

The helm chart’s internal README includes the following parameters. NVIDIA recommends that you use the chart version within the downloaded README as it has the most correct and up to date version of these parameters for that chart version.

Deployment parameters

Name

Description

Value

affinity [default: {}] Affinity settings for deployment. {}
containerSecurityContext Sets privilege and access control settings for container (Only affects the main container, not pod-level) {}
customCommand Overrides command line options sent to the NIM with the array listed here. []
customArgs Overrides command line arguments of the NIM container with the array listed here. []
env Adds arbitrary environment variables to the main container []
extraVolumes Adds arbitrary additional volumes to the deployment set definition {}
extraVolumeMounts Specify volume mounts to the main container from extraVolumes {}
image.repository NIM-LLM Image Repository ""
image.tag Image tag or version ""
image.pullPolicy Image pull policy ""
imagePullSecrets Specify list of secret names that are needed for the main container and any init containers.

initContainers Specify model init containers, if needed.

initContainers.ngcInit Legacy containers only. Instantiate and configure an NGC init container. It should either have NGC CLI pre-installed or wget + unzip pre-installed – must not be musl-based (alpine). {}
initContainers.extraInit Fully specify any additional init containers your use case requires. []
healthPort Specifies health check port. – for use with models.legacyCompat only since current NIMs have no separate port 8000
nodeSelector Sets node selectors for the NIM – for example nvidia.com/gpu.present: "true" {}
podAnnotations Sets additional annotations on the main deployment pods {}
podSecurityContext Specify privilege and access control settings for pod

podSecurityContext.runAsUser Specify user UID for pod. 1000
podSecurityContext.runAsGroup Specify group ID for pod. 1000
podSecurityContext.fsGroup Specify file system owner group id. 1000
replicaCount Specify static replica count for deployment. 1
resources Specify resources limits and requests for the running service.

resources.limits.nvidia.com/gpu Specify number of GPUs to present to the running service. 1
serviceAccount.create Specifies whether a service account should be created. false
serviceAccount.annotations Sets annotations to be added to the service account. {}
serviceAccount.name Specifies the name of the service account to use. If it is not set and create is true, a name is generated using a fullname template. ""
statefulSet.enabled Enables statefulset deployment. Enabling statefulSet allows PVC templates for scaling. If using central PVC with RWX accessMode, this isn’t needed. true
tolerations Specify tolerations for pod assignment. Allows the scheduler to schedule pods with matching taints.

Autoscaling parameters

Values used for creating a Horizontal Pod Autoscaler. If autoscaling is not enabled, the rest are ignored. NVIDIA recommends usage of the custom metrics API, commonly implemented with the prometheus-adapter. Standard metrics of CPU and memory are of limited use in scaling NIM.

Name

Description

Value

autoscaling.enabled Enables horizontal pod autoscaler. false
autoscaling.minReplicas Specify minimum replicas for autoscaling. 1
autoscaling.maxReplicas Specify maximum replicas for autoscaling. 10
autoscaling.metrics Array of metrics for autoscaling. []

Ingress parameters

Name

Description

Value

ingress.enabled Enables ingress. false
ingress.className Specify class name for Ingress. ""
ingress.annotations Specify additional annotations for ingress. {}
ingress.hosts Specify list of hosts each containing lists of paths.

ingress.hosts[0].host Specify name of host. chart-example.local
ingress.hosts[0].paths[0].path Specify ingress path. /
ingress.hosts[0].paths[0].pathType Specify path type. ImplementationSpecific
ingress.hosts[0].paths[0].serviceType Specify service type. It can be can be nemo or openai – make sure your model serves the appropriate port(s). openai
ingress.tls Specify list of pairs of TLS secretName and hosts. []

Probe parameters

Name

Description

Value

livenessProbe.enabled Enables `livenessProbe`` true
livenessProbe.method LivenessProbe http or script, but no script is currently provided http
livenessProbe.command `LivenessProbe`` script command to use (unsupported at this time) ["myscript.sh"]
livenessProbe.path `LivenessProbe`` endpoint path /v1/health/live
livenessProbe.initialDelaySeconds Initial delay seconds for livenessProbe 15
livenessProbe.timeoutSeconds Timeout seconds for livenessProbe 1
livenessProbe.periodSeconds Period seconds for livenessProbe 10
livenessProbe.successThreshold Success threshold for livenessProbe 1
livenessProbe.failureThreshold Failure threshold for livenessProbe 3
readinessProbe.enabled Enables readinessProbe true
readinessProbe.path Readiness Endpoint Path /v1/health/ready
readinessProbe.initialDelaySeconds Initial delay seconds for readinessProbe 15
readinessProbe.timeoutSeconds Timeout seconds for readinessProbe 1
readinessProbe.periodSeconds Period seconds for readinessProbe 10
readinessProbe.successThreshold Success threshold for readinessProbe 1
readinessProbe.failureThreshold Failure threshold for readinessProbe 3
startupProbe.enabled Enables startupProbe true
startupProbe.path StartupProbe Endpoint Path /v1/health/ready
startupProbe.initialDelaySeconds Initial delay seconds for startupProbe 40
startupProbe.timeoutSeconds Timeout seconds for startupProbe 1
startupProbe.periodSeconds Period seconds for startupProbe 10
startupProbe.successThreshold Success threshold for startupProbe 1
startupProbe.failureThreshold Failure threshold for startupProbe 180

Metrics parameters

Name

Description

Value

metrics Opens the metrics port for the triton inference server on port 8002.

metrics.enabled Enables metrics endpoint – for legacyCompat only since current NIMs serve metrics on the OpenAI API port. true
serviceMonitor Options for serviceMonitor to use the Prometheus Operator and the primary service object.

metrics.serviceMonitor.enabled Enables serviceMonitor creation. false
metrics.serviceMonitor.additionalLabels Specify additional labels for ServiceMonitor. {}

Models parameters

Name

Description

Value

model.nimCache Path to mount writeable storage or pre-filled model cache for the NIM ""
model.name Specify name of the model in the API (name of the NIM). This is mostly used for tests and is usually otherwise optional. This must match the name from /v1/models to allow helm test <release-name> to work. In legacyCompat, this is required and sets the name of the model in /v1/models meta/llama3-8b-instruct
model.ngcAPISecret Name of pre-existing secret with a key named NGC_API_KEY that contains an API key for NGC model downloads ""
model.ngcAPIKey NGC API key literal to use as the API secret and image pull secret when set ""
model.openaiPort Specify Open AI Port. 8000
model.labels Specify extra labels to be added on deployed pods. {}
model.jsonLogging Turn JSON lines logging on or off. Defaults to true. true
model.logLevel Log level of NIM service. Possible values of the variable are TRACE, DEBUG, INFO, WARNING, ERROR, CRITICAL. INFO

Deprecated and Legacy Model parameters

Name

Description

Value

model.legacyCompat Set true to enable compatibility with pre-release NIM versions prior to 1.0.0. false
model.numGpus (deprecated) Specify GPU requirements for the model. 1
model.subPath (deprecated) Specify path within the model volume to mount if not the root – default works with ngcInit and persistent volume. (legacyCompat only) model-store
model.modelStorePath (deprecated) Specify location of unpacked model. ""

Storage parameters

Name

Description

Value

persistence Specify settings to modify the path /model-store if model.legacyCompat is enabled else /.cache volume where the model is served from.

persistence.enabled Enables the use of persistent volumes. false
persistence.existingClaim Specifies an existing persistent volume claim. If using existingClaim, run only one replica or use a ReadWriteMany storage setup. ""
persistence.storageClass Specifies the persistent volume storage class. If set to "-", this disables dynamic provisioning. If left undefined or set to null, the cluster default storage provisioner is used. ""
persistence.accessMode Specify accessMode. If using an NFS or similar setup, you can use ReadWriteMany. ReadWriteOnce
persistence.stsPersistentVolumeClaimRetentionPolicy.whenDeleted Specifies persistent volume claim retention policy when deleted. Only used with Stateful Set volume templates. Retain
persistence.stsPersistentVolumeClaimRetentionPolicy.whenScaled Specifies persistent volume claim retention policy when scaled. Only used with Stateful Set volume templates. Retain
persistence.size Specifies the size of the persistent volume claim (for example 40Gi). 50Gi
persistence.annotations Adds annotations to the persistent volume claim. {}
hostPath Configures model cache on local disk on the nodes using hostPath – for special cases. You should understand the security implications before using this option.

hostPath.enabled Enable hostPath. false
hostPath.path Specifies path on the node used as a hostPath volume. /model-store
nfs Configures the model cache to sit on shared direct-mounted NFS. NOTE: you cannot set mount options using direct NFS mount to pods without a node-intalled nfsmount.conf. An NFS-based PersistentVolumeClaim is likely better in most cases.

nfs.enabled Enable direct pod NFS mount false
nfs.path Specify path on NFS server to mount /exports
nfs.server Specify NFS server address nfs-server.example.com
nfs.readOnly Set to true to mount as read-only false

Service parameters

Name

Description

Value

service.type Specifies the service type for the deployment. ClusterIP
service.name Overrides the default service name ""
service.openaiPort Specifies Open AI Port for the service. 8000
service.annotations Specify additional annotations to be added to service. {}
service.labels Specifies additional labels to be added to service. {}

Multi-node parameters

Large models that must span multiple nodes do not work on plain Kubernetes with the GPU Operator alone at this time. Optimized TensorRT profiles, when selected automatically or by environment variable, require either LeaderWorkerSets or the [MPI Operator]](https://github.com/kubeflow/mpi-operator)’s MPIJobs to be installed. Since MPIJob is a batch-type resource that is not designed with service stability and reliability in mind, you should use LeaderWorkerSets if your cluster version allows it. Only optimized profiles are supported for multi-node deployment at this time.

Name

Description

Value

multiNode.enabled Enables multi-node deployments false
multiNode.clusterStartTimeout Sets a number of seconds to wait for worker nodes to come up before failing 300
multiNode.gpusPerNode Number of GPUs that will be presented to each pod. In most cases, this should match resources.limits.nvidia.com/gpu 1
multiNode.workers Specifies how many worker pods per multi-node replica to launch 1
multiNode.leaderWorkerSet.enabled NVIDIA recommends you use LeaderWorkerSets to deploy. If disabled, defaults to using MPIJob from mpi-operator true
multiNode.mpiJob.workerAnnotations Annotations only applied to workers for MPIJob, if used. This may be necessary to ensure the workers connect to CNIs offered by multus and the network operator, if used. {}
multiNode.mpiJob.launcherResources Resources section to apply only to the launcher pods in MPIJob, if used. Launchers do not get the chart resources restrictions. Only workers do, since they require GPUs. {}
multiNode.optimized.enabled Enables optimized multi-node deployments (currently the only option supported) true
Previous Multi-node Deployment
Next Configuring a NIM
© Copyright © 2024, NVIDIA Corporation. Last updated on Sep 9, 2024.