Deploy NVIDIA NIM for large language models (LLMs)#

NeMo provides a capability of deploying NVIDIA NIM for large language models (LLMs) to your Kubernetes cluster. You can also configure and deploy NIMs for external endpoints, such as OpenAI ChatGPT and NVIDIA Integrate, within your Kubernetes cluster.

Prerequisites#

The NeMo Deployment Management Helm chart installed on your Kubernetes cluster. The NeMo Deployment Management microservice is running and reachable through the NeMo platform host base URL. The remainder of the guide on this page assumes that the platform host base URL is stored in an environment variable named NEMO_BASE_URL. For more information about the Helm chart installation, see NeMo Deployment Management Setup Guide.

Deploy a NIM#

You can deploy NIMs by submitting a POST request to the v1/deployment/model-deployments API.

The following example shows how to deploy the Meta LLaMa3-8b-instruct NIM from the NVIDIA NGC catalog.

curl --location '${NEMO_BASE_URL}/v1/deployment/model-deployments' \
   --header 'Content-Type: application/json' \
   --data '{
      "name": "llama3-8b-instruct",
      "namespace": "meta",
      "config": {
         "model": "meta/llama3-8b-instruct",
         "nim_deployment": {
            "image_name": "nvcr.io/nim/meta/llama3-8b-instruct",
            "image_tag":  "1.0.3",
            "pvc_size":   "25Gi",
            "gpu":       1,
            "additional_envs": {
               "<ENV_VAR>": "<value>"
            }
         }
      }
   }'

This makes the model accessible under the name you specified in the config.model field. For example, the model name in this case is meta/llama3-8b-instruct. After the deployment succeeds, you can proceed to Run Inference and start prompting the model.

You can deploy any available NIMs for LLMs in the NGC catalog by specifying the appropriate image name and tag.

Get Model Deployment Information#

To retrieve the metadata of the deployed NIM, send a GET request to the endpoint:

curl -X GET '${NEMO_BASE_URL}/v1/deployment/model-deployments/meta/llama3-8b-instruct'

Delete Deployed Models#

To delete a deployed NIM, send a DELETE request to the endpoint:

curl -X DELETE '${NEMO_BASE_URL}/v1/deployment/model-deployments/meta/llama3-8b-instruct'

Deployment Configuration Management#

You can manage NIM deployment configurations separately by using the v1/deployment/configs API. With this API, you can create model deployment configurations and keep them on the endpoint. Such pre-defined model deployment configurations are useful for the following cases:

When you want to deploy the same NIM while managing different configurations.
When you want to deploy multiple NIMs with the same configuration.

Create Deployment Configurations for Models from NGC Catalog#

The following command is an example of setting up the deployment configuration for the Meta LLaMa3-8b-instruct NIM. It sets up to pull the NIM container image for the specified model, persistent volume, and the number of GPUs. If needed, you can specify additional key-value pairs for custom environment variables.

export NEMO_BASE_URL=<nemo-base-url>
curl --location '${NEMO_BASE_URL}/v1/deployment/configs' \
   --header 'Content-Type: application/json' \
   --data '{
      "name": "llama-3.1-8b-instruct",
      "namespace": "meta",
      "model": "meta/llama-3.1-8b-instruct",
      "nim_deployment": {
         "image_name": "nvcr.io/nim/meta/llama-3.1-8b-instruct",
         "image_tag":  "1.0.3",
         "pvc_size":   "25Gi",
         "gpu":       1,
         "additional_envs": {
               "<ENV_VAR>": "<value>"
         }
      }
}'

Create Deployment Configurations for Models from External Endpoints#

To use external endpoints such as OpenAI ChatGPT and NVIDIA Integrate, create a deployment configuration as shown in the following examples.

Example for OpenAI ChatGPT:

curl -X POST '${NEMO_BASE_URL}/v1/deployment/configs' \
   --header 'Content-Type: application/json' \
   --data '{
         "name": "chatgpt",
         "namespace": "openai",
         "external_endpoint":{
            "host_url": "https://api.openai.com",
            "api_key": "${OPENAI_API_KEY}",
            "enabled_models" : ["gpt-3.5-turbo"]
         }
      }'

Example for NVIDIA Integrate:

curl -X PUT '${NEMO_BASE_URL}/v1/deployment/configs' \
   --header 'Content-Type: application/json' \
   --data '{
         "name": "integrate",
         "namespace": "nvidia",
         "external_endpoint":{
            "host_url": "https://integrate.api.nvidia.com",
            "api_key": "${NVIDIA_INTEGRATE_API_KEY}",
            "enabled_models" : ["meta/llama-3.1-405b-instruct"]
         }
      }'

By default, you can discover all models available through the external endpoint. To restrict access and expose only specific models, add the enabled_models property to specify a list of models you want to make discoverable.

The API key is only used internally to discover models. You also need to provide the key when making inference requests to the external endpoints.

Deploy NIM with Deployment Configuration#

After you are done with creating the configuration, deploy the NIM by running the following command.

curl --location '${NEMO_BASE_URL}/v1/deployment/model-deployments' \
   --header 'Content-Type: application/json' \
   --data '{
      "name": "llama-3.1-8b-instruct",
      "namespace": "meta",
      "config": "meta/llama-3.1-8b-instruct"
}'