Using External Endpoints#

You may want to use a external endpoint and therefore do not need to deploy specific resources. Follow the steps mentioned below and update the docker compose before deployment.

Remote LLM Endpoint#

The default docker compose deployment will launch the Llama 3.1 70b NIM to use as the LLM, but you may want to use a different LLM depending on your specific needs. This can be changed to use a different LLM by adjusting the configuration.

  1. Open config.yaml file

    Update model and base_url accordingly if needed.

  2. Change LLMs in config.yaml

    By default it will look like the following:

    summarization:
       enable: true
       method: "batch"
       llm:
          model: "meta/llama-3.1-70b-instruct"
          base_url: "https://integrate.api.nvidia.com/v1"
          max_tokens: 2048
          temperature: 0.2
          top_p: 0.7
    
       embedding:
          model: "nvidia/llama-3.2-nv-embedqa-1b-v2"
          base_url: "https://integrate.api.nvidia.com/v1"
    
       params:
          batch_size: 5
          batch_max_concurrency: 20
    
       prompts:
          caption: <caption_value>
          caption_summarization: <caption_summarization_value>
          summary_aggregation: <summary_aggregation_value>
    
    chat:
       rag: graph-rag # graph-rag or vector-rag
       params:
          batch_size: 1
          top_k: 5
    
       llm:
          model: "meta/llama-3.1-70b-instruct"
          base_url: "https://integrate.api.nvidia.com/v1"
          max_tokens: 2048
          temperature: 0.2
          top_p: 0.7
    
       embedding:
          model: "nvidia/llama-3.2-nv-embedqa-1b-v2"
          base_url: "https://integrate.api.nvidia.com/v1"
    
       reranker:
          model: "nvidia/llama-3.2-nv-rerankqa-1b-v2"
          base_url: "https://integrate.api.nvidia.com/v1"
    
       notification:
          enable: true
          endpoint: "http://127.0.0.1:60000/via-alert-callback"
          llm:
             model: "meta/llama-3.1-70b-instruct"
             base_url: "https://integrate.api.nvidia.com/v1"
             max_tokens: 2048
             temperature: 0.2
             top_p: 0.7
    

    Change the model and base_url to the new LLM

    • Examples:

      • Using GPT-4o model

        summarization:
           llm:
              model: "gpt-4o"
              base_url: "https://api.openai.com/v1"
        
           ...
        
           chat:
              llm:
                 model: "gpt-4o"
                 base_url: "https://api.openai.com/v1"
        
              ...
        
              notification:
                 llm:
                    model: "gpt-4o"
                    base_url: "https://api.openai.com/v1"
        

        Similarly, change the engine, model and base_url to the new LLM in guardrails/config.yml file as below.

        models:
           - type: main
             engine: openai
             model: gpt-4o
             parameters:
                base_url: https://api.openai.com/v1
        
      • Using deepseek-r1 model

        summarization:
           llm:
              model: "deepseek-ai/deepseek-r1"
              base_url: "https://integrate.api.nvidia.com/v1"
        
           ...
        
           chat:
              llm:
                 model: "deepseek-ai/deepseek-r1"
                 base_url: "https://integrate.api.nvidia.com/v1"
        
              ...
        
              notification:
                 llm:
                    model: "deepseek-ai/deepseek-r1"
                    base_url: "https://integrate.api.nvidia.com/v1"
        

        Similarly, change the engine, model and base_url to the new LLM in guardrails/config.yml file as below.

        models:
           - type: main
             engine: nim
             model: deepseek-ai/deepseek-r1
             parameters:
                base_url: https://integrate.api.nvidia.com/v1
        
  3. Set NVIDIA_API_KEY

    When using endpoints from build.nvidia.com, you need to set the NVIDIA_API_KEY environment variable in the .env file. Refer to Using NIMs from build.nvidia.com for obtaining the API key.

Remote Embedding and Reranker Endpoint#

Remote Embedding and Reranker Endpoint can be used by updating the config.yaml file as mentioned below.

embedding:
   model: "nvidia/llama-3.2-nv-embedqa-1b-v2"
   base_url: "https://integrate.api.nvidia.com/v1"

...

reranker:
   model: "nvidia/llama-3.2-nv-rerankqa-1b-v2"
   base_url: "https://ai.api.nvidia.com/v1/retrieval/nvidia/llama-3_2-nv-rerankqa-1b-v2/reranking"

Similarly, change the engine, model and base_url to the remote embedding endpoint in guardrails/config.yml file as below.

models:
   - type: embeddings
     engine: nim
     model: nvidia/llama-3.2-nv-embedqa-1b-v2
     parameters:
         base_url: "https://integrate.api.nvidia.com/v1"

Remote RIVA ASR Endpoint#

To use a remote RIVA ASR endpoint, you need to set the following environment variables in .env file:

export RIVA_ASR_SERVER_URI="grpc.nvcf.nvidia.com"
export RIVA_ASR_GRPC_PORT=443
export RIVA_ASR_SERVER_IS_NIM=true
export RIVA_ASR_SERVER_USE_SSL=true
export RIVA_ASR_SERVER_API_KEY=nvapi-***
export RIVA_ASR_SERVER_FUNC_ID="d8dd4e9b-fbf5-4fb0-9dba-8cf436c8d965"

Set RIVA_ASR_SERVER_API_KEY environment variable in .env file as shown in Using Riva ASR NIM from build.nvidia.com.

For more details about the env variables, please refer to the VSS Deployment-Time Configuration Glossary section.