Using External Endpoints#

You may want to use a external endpoint and therefore do not need to deploy specific resources. Follow the steps mentioned below and update the docker compose before deployment.

Remote LLM Endpoint#

The default docker compose deployment will launch the Llama 3.1 70b NIM to use as the LLM, but you may want to use a different LLM depending on your specific needs. This can be changed to use a different LLM by adjusting the configuration.

Open config.yaml file

Update model and base_url accordingly if needed.

Change LLMs in config.yaml

By default it will look like the following:

summarization:
   enable: true
   method: "batch"
   llm:
      model: "meta/llama-3.1-70b-instruct"
      base_url: "https://integrate.api.nvidia.com/v1"
      max_tokens: 2048
      temperature: 0.2
      top_p: 0.7

   embedding:
      model: "nvidia/llama-3.2-nv-embedqa-1b-v2"
      base_url: "https://integrate.api.nvidia.com/v1"

   params:
      batch_size: 5
      batch_max_concurrency: 20

   prompts:
      caption: <caption_value>
      caption_summarization: <caption_summarization_value>
      summary_aggregation: <summary_aggregation_value>

chat:
   rag: graph-rag # graph-rag or vector-rag
   params:
      batch_size: 1
      top_k: 5

   llm:
      model: "meta/llama-3.1-70b-instruct"
      base_url: "https://integrate.api.nvidia.com/v1"
      max_tokens: 2048
      temperature: 0.2
      top_p: 0.7

   embedding:
      model: "nvidia/llama-3.2-nv-embedqa-1b-v2"
      base_url: "https://integrate.api.nvidia.com/v1"

   reranker:
      model: "nvidia/llama-3.2-nv-rerankqa-1b-v2"
      base_url: "https://integrate.api.nvidia.com/v1"

   notification:
      enable: true
      endpoint: "http://127.0.0.1:60000/via-alert-callback"
      llm:
         model: "meta/llama-3.1-70b-instruct"
         base_url: "https://integrate.api.nvidia.com/v1"
         max_tokens: 2048
         temperature: 0.2
         top_p: 0.7

Change the model and base_url to the new LLM

Examples:

Using GPT-4o model

summarization:
   llm:
      model: "gpt-4o"
      base_url: "https://api.openai.com/v1"

   ...

   chat:
      llm:
         model: "gpt-4o"
         base_url: "https://api.openai.com/v1"

      ...

      notification:
         llm:
            model: "gpt-4o"
            base_url: "https://api.openai.com/v1"

Similarly, change the engine, model and base_url to the new LLM in guardrails/config.yml file as below.

models:
   - type: main
     engine: openai
     model: gpt-4o
     parameters:
        base_url: https://api.openai.com/v1

Using deepseek-r1 model

summarization:
   llm:
      model: "deepseek-ai/deepseek-r1"
      base_url: "https://integrate.api.nvidia.com/v1"

   ...

   chat:
      llm:
         model: "deepseek-ai/deepseek-r1"
         base_url: "https://integrate.api.nvidia.com/v1"

      ...

      notification:
         llm:
            model: "deepseek-ai/deepseek-r1"
            base_url: "https://integrate.api.nvidia.com/v1"

Similarly, change the engine, model and base_url to the new LLM in guardrails/config.yml file as below.

models:
   - type: main
     engine: nim
     model: deepseek-ai/deepseek-r1
     parameters:
        base_url: https://integrate.api.nvidia.com/v1

Set NVIDIA_API_KEY

When using endpoints from build.nvidia.com, you need to set the NVIDIA_API_KEY environment variable in the .env file. Refer to Using NIMs from build.nvidia.com for obtaining the API key.

Remote Embedding and Reranker Endpoint#

Remote Embedding and Reranker Endpoint can be used by updating the config.yaml file as mentioned below.

embedding:
   model: "nvidia/llama-3.2-nv-embedqa-1b-v2"
   base_url: "https://integrate.api.nvidia.com/v1"

...

reranker:
   model: "nvidia/llama-3.2-nv-rerankqa-1b-v2"
   base_url: "https://ai.api.nvidia.com/v1/retrieval/nvidia/llama-3_2-nv-rerankqa-1b-v2/reranking"

Similarly, change the engine, model and base_url to the remote embedding endpoint in guardrails/config.yml file as below.

models:
   - type: embeddings
     engine: nim
     model: nvidia/llama-3.2-nv-embedqa-1b-v2
     parameters:
         base_url: "https://integrate.api.nvidia.com/v1"

Remote RIVA ASR Endpoint#

To use a remote RIVA ASR endpoint, you need to set the following environment variables in .env file:

export RIVA_ASR_SERVER_URI="grpc.nvcf.nvidia.com"
export RIVA_ASR_GRPC_PORT=443
export RIVA_ASR_SERVER_IS_NIM=true
export RIVA_ASR_SERVER_USE_SSL=true
export RIVA_ASR_SERVER_API_KEY=nvapi-***
export RIVA_ASR_SERVER_FUNC_ID="d8dd4e9b-fbf5-4fb0-9dba-8cf436c8d965"

Set RIVA_ASR_SERVER_API_KEY environment variable in .env file as shown in Using Riva ASR NIM from build.nvidia.com.

For more details about the env variables, please refer to the VSS Deployment-Time Configuration Glossary section.