Using External Endpoints#
You may want to use a external endpoint and therefore do not need to deploy specific resources. Follow the steps mentioned below and update the docker compose before deployment.
Remote LLM Endpoint#
The default docker compose deployment will launch the Llama 3.1 70b NIM to use as the LLM, but you may want to use a different LLM depending on your specific needs. This can be changed to use a different LLM by adjusting the configuration.
Open config.yaml file
Update model and base_url accordingly if needed.
Change LLMs in config.yaml
By default it will look like the following:
summarization: enable: true method: "batch" llm: model: "meta/llama-3.1-70b-instruct" base_url: "https://integrate.api.nvidia.com/v1" max_tokens: 2048 temperature: 0.2 top_p: 0.7 embedding: model: "nvidia/llama-3.2-nv-embedqa-1b-v2" base_url: "https://integrate.api.nvidia.com/v1" params: batch_size: 5 batch_max_concurrency: 20 prompts: caption: <caption_value> caption_summarization: <caption_summarization_value> summary_aggregation: <summary_aggregation_value> chat: rag: graph-rag # graph-rag or vector-rag params: batch_size: 1 top_k: 5 llm: model: "meta/llama-3.1-70b-instruct" base_url: "https://integrate.api.nvidia.com/v1" max_tokens: 2048 temperature: 0.2 top_p: 0.7 embedding: model: "nvidia/llama-3.2-nv-embedqa-1b-v2" base_url: "https://integrate.api.nvidia.com/v1" reranker: model: "nvidia/llama-3.2-nv-rerankqa-1b-v2" base_url: "https://integrate.api.nvidia.com/v1" notification: enable: true endpoint: "http://127.0.0.1:60000/via-alert-callback" llm: model: "meta/llama-3.1-70b-instruct" base_url: "https://integrate.api.nvidia.com/v1" max_tokens: 2048 temperature: 0.2 top_p: 0.7
Change the model and base_url to the new LLM
Examples:
Using GPT-4o model
summarization: llm: model: "gpt-4o" base_url: "https://api.openai.com/v1" ... chat: llm: model: "gpt-4o" base_url: "https://api.openai.com/v1" ... notification: llm: model: "gpt-4o" base_url: "https://api.openai.com/v1"
Similarly, change the engine, model and base_url to the new LLM in guardrails/config.yml file as below.
models: - type: main engine: openai model: gpt-4o parameters: base_url: https://api.openai.com/v1
Using deepseek-r1 model
summarization: llm: model: "deepseek-ai/deepseek-r1" base_url: "https://integrate.api.nvidia.com/v1" ... chat: llm: model: "deepseek-ai/deepseek-r1" base_url: "https://integrate.api.nvidia.com/v1" ... notification: llm: model: "deepseek-ai/deepseek-r1" base_url: "https://integrate.api.nvidia.com/v1"
Similarly, change the engine, model and base_url to the new LLM in guardrails/config.yml file as below.
models: - type: main engine: nim model: deepseek-ai/deepseek-r1 parameters: base_url: https://integrate.api.nvidia.com/v1
Set NVIDIA_API_KEY
When using endpoints from build.nvidia.com, you need to set the
NVIDIA_API_KEY
environment variable in the.env
file. Refer to Using NIMs from build.nvidia.com for obtaining the API key.
Remote Embedding and Reranker Endpoint#
Remote Embedding and Reranker Endpoint can be used by updating the config.yaml file as mentioned below.
embedding:
model: "nvidia/llama-3.2-nv-embedqa-1b-v2"
base_url: "https://integrate.api.nvidia.com/v1"
...
reranker:
model: "nvidia/llama-3.2-nv-rerankqa-1b-v2"
base_url: "https://ai.api.nvidia.com/v1/retrieval/nvidia/llama-3_2-nv-rerankqa-1b-v2/reranking"
Similarly, change the engine, model and base_url to the remote embedding endpoint in guardrails/config.yml file as below.
models:
- type: embeddings
engine: nim
model: nvidia/llama-3.2-nv-embedqa-1b-v2
parameters:
base_url: "https://integrate.api.nvidia.com/v1"
Remote RIVA ASR Endpoint#
To use a remote RIVA ASR endpoint, you need to set the following environment variables in .env file:
export RIVA_ASR_SERVER_URI="grpc.nvcf.nvidia.com"
export RIVA_ASR_GRPC_PORT=443
export RIVA_ASR_SERVER_IS_NIM=true
export RIVA_ASR_SERVER_USE_SSL=true
export RIVA_ASR_SERVER_API_KEY=nvapi-***
export RIVA_ASR_SERVER_FUNC_ID="d8dd4e9b-fbf5-4fb0-9dba-8cf436c8d965"
Set RIVA_ASR_SERVER_API_KEY
environment variable in .env file as shown in Using Riva ASR NIM from build.nvidia.com.
For more details about the env variables, please refer to the VSS Deployment-Time Configuration Glossary section.