Using External Endpoints#

You might want to use a external endpoint and therefore do not need to deploy specific resources. Follow the steps below and update the Docker Compose before deployment.

Remote LLM Endpoint#

The default Docker Compose deployment will launch the Llama 3.1 70b NIM to use as the LLM, but you might want to use a different LLM depending on your specific needs. This can be changed to use a different LLM by adjusting the configuration.

  1. Open ``config.yaml`` file

    Update model and base_url accordingly if needed.

  2. Change LLMs in config.yaml

    By default it will look like the following:

    tools:
       graph_db:
          type: neo4j
          params:
             host: !ENV ${GRAPH_DB_HOST}
             port: !ENV ${GRAPH_DB_BOLT_PORT}
             username: !ENV ${GRAPH_DB_USERNAME}
             password: !ENV ${GRAPH_DB_PASSWORD}
          tools:
             embedding: nvidia_embedding
    
       vector_db:
          type: milvus
          params:
             host: !ENV ${MILVUS_DB_HOST}
             port: !ENV ${MILVUS_DB_GRPC_PORT}
          tools:
             embedding: nvidia_embedding
    
       chat_llm:
          type: llm
          params:
             model: "nvdev/meta/llama-3.1-70b-instruct"
             base_url: "https://integrate.api.nvidia.com/v1"
             max_tokens: 2048
             temperature: 0.2
             top_p: 0.7
             api_key: !ENV ${NVIDIA_API_KEY}
    
       summarization_llm:
          type: llm
          params:
             model: "nvdev/meta/llama-3.1-70b-instruct"
             base_url: "https://integrate.api.nvidia.com/v1"
             max_tokens: 2048
             temperature: 0.2
             top_p: 0.7
             api_key: !ENV ${NVIDIA_API_KEY}
    
       notification_llm:
          type: llm
          params:
             model: "nvdev/meta/llama-3.1-70b-instruct"
             base_url: "https://integrate.api.nvidia.com/v1"
             max_tokens: 2048
             temperature: 0.2
             top_p: 0.7
             api_key: !ENV ${NVIDIA_API_KEY}
    
       nvidia_embedding:
          type: embedding
          params:
             model: "nvdev/nvidia/llama-3.2-nv-embedqa-1b-v2"
             base_url: "https://integrate.api.nvidia.com/v1"
             api_key: !ENV ${NVIDIA_API_KEY}
    
       nvidia_reranker:
          type: reranker
          params:
             model: "nvidia/llama-3.2-nv-rerankqa-1b-v2"
             base_url: "https://ai.api.nvidia.com/v1/retrieval/nvidia/llama-3_2-nv-rerankqa-1b-v2/reranking"
             api_key: !ENV ${NVIDIA_API_KEY}
    
       notification_tool:
          type: alert_sse_notifier
          params:
             endpoint: "http://127.0.0.1:60000/via-alert-callback"
    
    functions:
       summarization:
          type: batch_summarization
          params:
             batch_size: 6 # Use even batch size if speech recognition enabled.
             batch_max_concurrency: 20
             prompts:
                caption: "Write a concise and clear dense caption for the provided warehouse video, focusing on irregular or hazardous events such as boxes falling, workers not wearing PPE, workers falling, workers taking photographs, workers chitchatting, forklift stuck, etc. Start and end each sentence with a time stamp."
                caption_summarization: "You should summarize the following events of a warehouse in the format start_time:end_time:caption. For start_time and end_time use . to seperate seconds, minutes, hours. If during a time segment only regular activities happen, then ignore them, else note any irregular activities in detail. The output should be bullet points in the format start_time:end_time: detailed_event_description. Don't return anything else except the bullet points."
                summary_aggregation: "You are a warehouse monitoring system. Given the caption in the form start_time:end_time: caption, Aggregate the following captions in the format start_time:end_time:event_description. If the event_description is the same as another event_description, aggregate the captions in the format start_time1:end_time1,...,start_timek:end_timek:event_description. If any two adjacent end_time1 and start_time2 is within a few tenths of a second, merge the captions in the format start_time1:end_time2. The output should only contain bullet points.  Cluster the output into Unsafe Behavior, Operational Inefficiencies, Potential Equipment Damage and Unauthorized Personnel"
          tools:
             llm: summarization_llm
             db: graph_db
    
       ingestion_function:
          type: graph_ingestion
          params:
             batch_size: 1
             image: false
             cot: false
             top_k: 5
          tools:
             llm: chat_llm
             db: graph_db
    
       retriever_function:
          type: graph_retrieval
          params:
             batch_size: 1
             image: false
             cot: false
             top_k: 5
          tools:
             llm: chat_llm
             db: graph_db
    
       notification:
          type: notification
          params:
             events: []
          tools:
             llm: chat_llm
             notification_tool: notification_tool
    
    context_manager:
       functions:
          - summarization
          - ingestion_function
          - retriever_function
          - notification
    

    Change the model and base_url to the new LLM in the respective llm tool section in config.yaml file:

    • Examples:

      • Using GPT-4o model for chat_llm

        tools:
           chat_llm:
              params:
                 model: "gpt-4o"
                 base_url: "https://api.openai.com/v1"
                 api_key: !ENV ${OPENAI_API_KEY}
        

        Similarly, change the engine, model, and base_url to the new LLM in the guardrails/config.yml file:

        models:
           - type: main
             engine: openai
             model: gpt-4o
             parameters:
                base_url: https://api.openai.com/v1
        
      • Using deepseek-r1 model for chat_llm

        tools:
           chat_llm:
              params:
                 model: "deepseek-ai/deepseek-r1"
                 base_url: "https://integrate.api.nvidia.com/v1"
                 api_key: !ENV ${NVIDIA_API_KEY}
        

        Similarly, change the engine, model, and base_url to the new LLM in guardrails/config.yml file:

        models:
           - type: main
             engine: nim
             model: deepseek-ai/deepseek-r1
             parameters:
                base_url: https://integrate.api.nvidia.com/v1
        
  3. Set NVIDIA_API_KEY

    When using endpoints from build.nvidia.com, you need to set the NVIDIA_API_KEY environment variable in the .env file. Refer to Using NIMs from build.nvidia.com for obtaining the API key.

Remote Embedding and Reranker Endpoint#

Remote Embedding and Reranker Endpoint can be used by updating the config.yaml file:

tools:
   nvidia_embedding:
      type: embedding
      params:
         model: "nvidia/llama-3.2-nv-embedqa-1b-v2"
         base_url: "https://integrate.api.nvidia.com/v1"
         api_key: !ENV ${NVIDIA_API_KEY}

   nvidia_reranker:
      type: reranker
      params:
         model: "nvidia/llama-3.2-nv-rerankqa-1b-v2"
         base_url: "https://ai.api.nvidia.com/v1/retrieval/nvidia/llama-3_2-nv-rerankqa-1b-v2/reranking"
         api_key: !ENV ${NVIDIA_API_KEY}

Similarly, change the engine, model, and base_url to the remote embedding endpoint in guardrails/config.yml file:

models:
   - type: embeddings
     engine: nim
     model: nvidia/llama-3.2-nv-embedqa-1b-v2
     parameters:
         base_url: "https://integrate.api.nvidia.com/v1"

Remote RIVA ASR Endpoint#

To use a remote RIVA ASR endpoint, you need to set the following environment variables in .env file:

export RIVA_ASR_SERVER_URI="grpc.nvcf.nvidia.com"
export RIVA_ASR_GRPC_PORT=443
export RIVA_ASR_SERVER_IS_NIM=true
export RIVA_ASR_SERVER_USE_SSL=true
export RIVA_ASR_SERVER_API_KEY=nvapi-***
export RIVA_ASR_SERVER_FUNC_ID="d8dd4e9b-fbf5-4fb0-9dba-8cf436c8d965"

Set RIVA_ASR_SERVER_API_KEY environment variable in .env file as shown in Using Riva ASR NIM from build.nvidia.com.

For more details about the env variables, refer to the VSS Deployment-Time Configuration Glossary section.