Large Language Models (Latest)
Large Language Models (Latest)

Deploying with Helm

NIMs are intended to be run on a system with NVIDIA GPUs, with the type and number of GPUs depending on the model. To use helm, you must have a Kubernetes cluster with appropriate GPU nodes and the GPU Operator installed.

For requirements, including the type and number of GPUs, see Support Matrix.

If you haven’t set up your NGC API key and do not know exactly which NIM you want to download and deploy, see the information in Getting Started.

Once you have your NGC API key set, go to the NGC Catalog and select the nim-llm helm chart to pick a version. In most cases, the latest version will be desired.

Download the helm chart with:

Copy
Copied!
            

helm fetch https://helm.ngc.nvidia.com/nim/charts/nim-llm-<version_number>.tgz" --username=\$oauthtoken --password=$NGC_CLI_API_KEY

This will download the chart as a file to your local machine.

The following helm options are the most important options to configure to deploy a NIM using Kubernetes:

  • image.repository – The container/NIM to deploy

  • image.tag – The version of that container/NIM

  • Storage options, based on the environment and cluster in use

  • model.ngcAPISecret and imagePullSecrets to communicate with NGC

  • resources – Use this option when a model requires more than the default of one GPU. See support matrix for details about the GPUs to request to meet the GPU memory requirements of the model on the available hardware.

  • env – Which is an array of environment variables presented to the container, if advanced configuration is needed

    • Note: Do not set the following environment variables using the env value. Instead, use the indicated helm options:

      Environment Variable

      Helm Value

      NIM_CACHE_PATH model.nimCache
      NGC_API_KEY model.ngcAPISecret
      NIM_SERVER_PORT model.openaiPort
      NIM_JSONL_LOGGING model.jsonLogging
      NIM_LOG_LEVEL model.logLevel
      In these cases, set the helm values directly instead of relying on the environment variable values. You can add other environment variables to the env section of a values file.

To adapt the chart’s deployment behavior to your cluster’s needs, refer to the helm chart’s README, which lists and describes the configuration options. This README is available on the helm command line, but the output is bare markdown. Output it to a file and open with a markdown renderer or use a command line tool such as glow to render in the terminal.

The following helm command displays the chart README and renders it in the terminal using glow:

Copy
Copied!
            

helm show readme nim-llm-<version_number>.tgz | glow -p -

To examine all default values, run the following command:

Copy
Copied!
            

helm show values nim-llm-<version_number>.tgz

Minimal example

This example requires that you have already established certain Kubernetes secrets in the deployment namespace to work before proceeding. The rest of this document will assume the default namespace.

To download the NIM container image, this requires an image pull secret (in this case named registry-secret). To download model engines or weights from NGC, the chart requires a generic secret that has an NGC API key as a value stored in a key named NGC_CLI_API_KEY. For example purposes, these can be created by hand with:

Copy
Copied!
            

kubectl create secret docker-registry registry-secret --docker-server=nvcr.io --docker-username='$oauthtoken' --docker-password=$NGC_CLI_API_KEY kubectl create secret generic ngc-api --from-literal=NGC_CLI_API_KEY=$NGC_CLI_API_KEY

Create the file custom-values.yaml with the following entries. These values will work in most clusters after the secrets are created above.

Copy
Copied!
            

image: repository: "nvcr.io/nim/meta/llama3-8b-instruct" # container location tag: 1.0.0 # NIM version you want to deploy model: ngcAPISecret: ngc-api # name of a secret in the cluster that includes a key named NGC_CLI_API_KEY and is an NGC API key persistence: enabled: true imagePullSecrets: - name: registry-secret # name of a secret used to pull nvcr.io images, see https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/

You can adapt the previous configuration to deploy any model, such as llama3-70b-instruct, by adjusting to the model’s requirements and size. For example:

Copy
Copied!
            

image: repository: "nvcr.io/nim/meta/llama3-70b-instruct" # container location -- changed for the different model tag: 1.0.0 model: ngcAPISecret: ngc-api persistence: enabled: true size: 220Gi # the model files will be quite large resources: limits: nvidia.com/gpu: 4 # much more GPU memory is required imagePullSecrets: - name: registry-secret

Running out of storage space is always a concern when setting up NIMs, and downloading models can delay scaling in a cluster. Models can be quite large, and a cluster operator can quickly fill disk space when downloading them. Be sure to mount some type of persistent storage for the model cache on your pod. You have the following mutually-exclusive options when storing objects outside of the default of an emptyDir:

  • Persistent Volume Claims (enabled with persistence.enabled)

    • Used when persistence.accessMode is set to “ReadWriteMany” where several pods can share one PVC.

    • If statefulSet.enabled is set to false (default is true), this will create a PVC with a deployment, but if the access mode is not ReadWriteMany, such as with an NFS provisioner, scaling beyond one pod will likely fail.

  • Persistent Volume Claim templates (enabled with persistence.enabled and leaving statefulSet.enabled as default)

    • Useful for scaling using a strategy of scaling up the StatefulSet to download the model to each PVC created for a maximum replicas desired, and then scaling down again, leaving those PVCs in place to allow fast scaling up.

  • Direct NFS (enabled with nfs.enabled)

    • Kubernetes does not allow setting of mount options on direct NFS, so some special cluster setup may be required.

  • hostPath (enabled with hostPath.enabled)

    • Know the security implications of using hostPath and understand that this will also tie pods to one node.

Note

Requires NIM version 1.1.0+

Two options exist for deploying multi-node NIMs on Kubernetes: LeaderWorkerSets and MPI Jobs using the MPI Operator.

LeaderWorkerSet

Note

Requires Kubernetes version >1.26

LeaderWorkerSet (LWS) deployments are the recommended method for deploying Multi-Node models with NIM. To enable LWS deployments, see the installation instructions in the LWS documentation. The helm chart defaults to LWS for multi-node deployment.

With LWS deployments, you will see Leader and Worker pods that coordinate together to run your multi-node models.

LWS deployments support manual scaling and auto scaling, where the entire set of pods are treated as a single replica. However, there are some limitations to scaling when using LWS deployments. If scaling manually (autoscaling is not enabled), you cannot scale above the initial number of replicas set in the helm chart.

MPI Job

MPI Jobs via the MPI Operator are an alternative deployment option for clusters that don’t support LeaderWorkerSet. To enable MPI Jobs, install the MPI operator. Then, add the following to your values.yaml file:

Copy
Copied!
            

multiNode: leaderWorkerSet: enabled: False

For MPI Jobs, you will see a launcher pod and one or more worker pods deployed for your model. The launcher pod does not require any GPUs, and deployment logs will be available through the launcher pod.

When deploying with MPI Jobs you can set a number of replicas, however dynamic scaling is not supported without redeploying the helm chart. MPI Jobs also do not automatically restart, so if any pod in the multi-node set fails, the job must be manually restarted.

You are now ready to launch the chart.

Copy
Copied!
            

helm install my-nim nim-llm-<version_number>.tgz -f path/to/your/custom-values.yaml

Wait for the pod to reach “Ready” status.

In the previous example the OpenAI compatible API endpoint is exposed on port 8000 through the Kubernetes service of the default type with no ingress, since authentication is not handled by the NIM itself. The following commands assume the Llama 3 8B Instruct model was deployed. Adjust the “model” value in the request JSON body to use a different model.

Use the following command to port-forward the service to your local machine to test inference.

Copy
Copied!
            

kubectl port-forward service/my-nim-nim-llm 8000:http-openai

Then try a request:

Copy
Copied!
            

curl -X 'POST' \ 'http://localhost:8000/v1/chat/completions' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "messages": [ { "content": "You are a polite and respectful chatbot helping people plan a vacation.", "role": "system" }, { "content": "What should I do for a 4 day vacation in Spain?", "role": "user" } ], "model": "meta/llama3-8b-instruct", "max_tokens": 16, "top_p": 1, "n": 1, "stream": false, "stop": "\n", "frequency_penalty": 0.0 }'

Previous Multi-node Deployment
Next Configuring a NIM
© Copyright © 2024, NVIDIA Corporation. Last updated on Jul 26, 2024.