Deploy NVIDIA NIM#

Deploying a NIM microservice with NeMo Deployment Management#

Use the NeMo Deployment Management microservice to deploy NVIDIA NIM microservices to your Kubernetes cluster. You can also configure the microservice with external endpoints such as OpenAI (ChatGPT) and build.nvidia.com within your cluster.

Prerequisites#

  • Your cluster administrator has installed the NeMo Deployment Management microservice on your Kubernetes cluster following the installation guide at NeMo Deployment Management Configuration.

  • You have stored the NeMo Deployment Management host base URL in the DEPLOYMENT_BASE_URL environment variable. The value for DEPLOYMENT_BASE_URL will depend on the ingress in your particular cluster. If you are using the minikube demo installation, it will be http://nemo.test. Otherwise, you will need to consult with your own cluster administrator.


Deployment Methods#

You can deploy a NIM in the following ways:

  1. Direct Deployment Using the NeMo Deployment Management API

  2. Deployment with Pre-defined Configurations

Direct Deployment Using the NeMo Deployment Management API#

To deploy a NIM, submit a POST request to the v1/deployment/model-deployments API as shown in the following example.

Deploy Meta LLaMa3-8b-instruct NIM from NGC
curl --location "${DEPLOYMENT_BASE_URL}/v1/deployment/model-deployments" \
   -H 'accept: application/json' \
   -H 'Content-Type: application/json' \
   -d '{
      "name": "llama-3.1-8b-instruct",
      "namespace": "meta",
      "config": {
         "model": "meta/llama-3.1-8b-instruct",
         "nim_deployment": {
            "image_name": "nvcr.io/nim/meta/llama-3.1-8b-instruct",
            "image_tag": "1.8",
            "pvc_size":   "25Gi",
            "gpu":       1,
            "additional_envs": {
               "NIM_GUIDED_DECODING_BACKEND": "fast_outlines"
            }
         }
      }
   }'
Deploy LLaMa Nemotron Nano from Hugging Face with NVIDIA Multi-LLM NIM

The model parameter should match the Hugging Face model name. The NIM_MODEL_NAME environment variable must be set with the same Hugging Face model name, prefixed with hf://. Your Hugging Face API token must be passed in the hf_token field. The token will be stored as a Kubernetes secret.

curl --location "${DEPLOYMENT_BASE_URL}/v1/deployment/model-deployments" \
   -H 'accept: application/json' \
   -H 'Content-Type: application/json' \
  -d '{
    "name": "Llama-3.1-Nemotron-Nano-4B-v1.1",
    "namespace": "nvidia",
    "hf_token": "<your-hf-token>",
    "config": {
      "model": "nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1",
      "nim_deployment": {
        "image_name": "nvcr.io/nim/nvidia/llm-nim",
        "image_tag": "1.13.1",
        "gpu": 1,
        "additional_envs": {
          "NIM_MODEL_NAME": "hf://nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1",
          "NIM_GUIDED_DECODING_BACKEND": "outlines"
        }
      }
    }
  }'

After deployment, you can access the model under the name specified in config.model (for example, meta/llama3.1-8b-instruct).

Deployment with Pre-defined Configurations#

You can create manage NIM deployment configurations separately by using the v1/deployment/configs API. You can specify the configurations in the request body of the v1/deployment/model-deployments API.

This method is useful for:

  • Deploying the same NIM with different configurations.

  • Deploying multiple NIMs with the same configuration.

The following procedure shows how to set up a pre-defined configuration and deploy it.

  1. Create the deployment configuration.

    curl --location "${DEPLOYMENT_BASE_URL}/v1/deployment/configs" \
       --header 'Content-Type: application/json' \
       --data '{
          "name": "llama-3.1-8b-instruct",
          "namespace": "meta",
          "config": {
             "model": "meta/llama-3.1-8b-instruct",
             "nim_deployment": {
                "image_name": "nvcr.io/nim/meta/llama-3.1-8b-instruct",
                "image_tag": "1.8",
                "pvc_size":   "25Gi",
                "gpu":       1,
                "additional_envs": {
                   "NIM_GUIDED_DECODING_BACKEND": "fast_outlines"
                }
             }
          }
       }'
    
    curl --location "${DEPLOYMENT_BASE_URL}/v1/deployment/configs" \
       --header 'Content-Type: application/json' \
       --data '{
          "name": "Llama-3.1-Nemotron-Nano-4B-v1.1",
          "namespace": "nvidia",
          "config": {
             "model": "nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1",
             "nim_deployment": {
                "image_name": "nvcr.io/nim/nvidia/llm-nim",
                "image_tag": "1.13.1",
                "gpu": 1,
                "additional_envs": {
                   "NIM_MODEL_NAME": "hf://nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1",
                   "NIM_GUIDED_DECODING_BACKEND": "outlines"
                }
             }
          }
       }'
    
    curl -X POST "${DEPLOYMENT_BASE_URL}/v1/deployment/configs" \
       --header 'Content-Type: application/json' \
       --data '{
             "name": "chatgpt",
             "namespace": "openai",
             "external_endpoint":{
                "host_url": "https://api.openai.com",
                "api_key": "${OPENAI_API_KEY}",
                "enabled_models" : ["gpt-3.5-turbo"]
             }
          }'
    
    curl -X PUT "${DEPLOYMENT_BASE_URL}/v1/deployment/configs" \
       --header 'Content-Type: application/json' \
       --data '{
             "name": "integrate",
             "namespace": "nvidia",
             "external_endpoint":{
                "host_url": "https://integrate.api.nvidia.com",
                "api_key": "${NVIDIA_INTEGRATE_API_KEY}",
                "enabled_models" : ["meta/llama-3.1-405b-instruct"]
             }
          }'
    
  2. Deploy. After creating a configuration, you can deploy the NIM by referencing the config name:

    curl --location "${DEPLOYMENT_BASE_URL}/v1/deployment/model-deployments" \
       --header 'Content-Type: application/json' \
       --data '{
          "name": "llama-3.1-8b-instruct",
          "namespace": "meta",
          "config": "meta/llama-3.1-8b-instruct"
    }'
    
    curl --location "${DEPLOYMENT_BASE_URL}/v1/deployment/model-deployments" \
       --header 'Content-Type: application/json' \
       --data '{
          "name": "Llama-3.1-Nemotron-Nano-4B-v1.1",
          "namespace": "nvidia",
          "hf_token": "<your-hf-token>",
          "config": "nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1"
    }'
    
    curl --location "${DEPLOYMENT_BASE_URL}/v1/deployment/model-deployments" \
       --header 'Content-Type: application/json' \
       --data '{
          "name": "chatgpt",
          "namespace": "openai",
          "config": "openai/chatgpt"
    }'
    
    curl --location "${DEPLOYMENT_BASE_URL}/v1/deployment/model-deployments" \
       --header 'Content-Type: application/json' \
       --data '{
          "name": "integrate",
          "namespace": "nvidia",
          "config": "nvidia/integrate"
    }'
    

Deploying a NIM microservice with Helm#

Use the procedure below to deploy NIM microservices with the Helm chart. Choose this option if you don’t use the NeMo Deployment Management microservice.

Prerequisites#

  • You have the required RBAC permission to deploy resources with helm install on your cluster.

  • You have an NGC API key exported to an NGC_API_KEY environment variable.

  • You have helm installed on your machine.

  • You know the internal Kubernetes DNS entry for the installation of NeMo Entity Store (default of http://nemo-entity-store:8000)

  • You know the namespace where the NeMo microservice platform is installed.

Deployment Steps#

  1. Fetch the NIM for LLMs Helm chart:

    helm fetch https://helm.ngc.nvidia.com/nim/charts/nim-llm-1.10.0-rc4.tgz --username='${oauthtoken}' --password=${NGC_API_KEY}
    
  2. Define the values.yaml for the NIM Helm chart. The following is a sample file to deploy Llama-3_3-Nemotron-Super-49B-v1.

    image:
        # These values set what NIM image to deploy
        repository: nvcr.io/nim/nvidia/llama-3.3-nemotron-super-49b-v1
        tag: 1.8.6
    imagePullSecrets:
        - name: nvcrimagepullsecret
    
    # This section is required for the NIM to be usable with NIM Proxy
    service:
        labels:
            app.nvidia.com/nim-type: "inference"
    
    # This section is required if the NIM needs more than 1 GPU
    resources:
      limits:
        nvidia.com/gpu: 4
      requests:
        nvidia.com/gpu: 4
    
    env:
        - name: NIM_SERVED_MODEL_NAME
          value: "llama-3.3-nemotron-super-49b-v1"
        - name: NIM_MODEL_NAME
          value: llama-3.3-nemotron-super-49b-v1
        - name: NIM_GUIDED_DECODING_BACKEND
          value: fast_outlines
    

    For many NIM microservices, you can include NIM_PEFT_SOURCE and NIM_PEFT_REFRESH_INTERVAL in the env section to let the NIM automatically pull LoRA adaptors from NeMo Data Store. However, Llama-3_3-Nemotron-Super-49B-v1 and Llama-3.1-Nemotron-Nano-8B-v1 do not support LoRA adapters in 1.8.6, so these NIM microservices are omitted here. If you wanted to deploy a model that supported LoRA adaptors, the env section would look like the following:

        - name: NIM_SERVED_MODEL_NAME
          value: "nvidia/llama-3.3-nemotron-super-49b-v1"
        - name: NIM_MODEL_NAME
          value: "nvidia/llama-3.3-nemotron-super-49b-v1"
        - name: NIM_GUIDED_DECODING_BACKEND
          value: fast_outlines
        - name: NIM_PEFT_SOURCE
          value: http://nemo-entity-store:8000 # Replace with your Entity Store's internal DNS entry and port.
        - NIM_PEFT_REFRESH_INTERVAL
          value: 30
    
  3. Run the following command to install the NIM microservice:

    helm install -n [the nmp namespace]  my-nim ./nim-llm-1.10.0-rc4.tgz -f values.yaml  
    
  4. To uninstall the NIM microservice, run:

     helm uninstall -n [the nmp namespace] my-nim      
    

For more information, refer to Deploy NIM Microservice with Helm in the NVIDIA NIM for LLMs documentation.