Configure the VLM#
This page describes how to configure the VLM for the VSS agent workflow—both local (NIM on the same machine as the agent) and remote (VLM served elsewhere). You can run the LLM and VLM locally or remotely, in any combination that suits your setup.
For remote VLMs, as long as the API is OpenAI compatible, you can point the agent to it for video analysis during report generation, video understanding, long video summarization, and more. Remote endpoints can enable rapid startup when they are already up and running.
Local VLM customization#
When deploying with a local VLM, the agent workflow uses a default NIM endpoint. You can customize it as follows:
Default model:
nvidia/cosmos-reason2-8b(configured as a local NIM endpoint).Custom NIM env file: At deploy time, pass
--vlm-env-filewith the path (absolute or relative to the current directory) to an env file that overrides NIM settings. See NIM configuration settings in Prerequisites.Different local model: Pass
--vlmwith one of the supported model names:nvidia/cosmos-reason1-7bQwen/Qwen3-VL-8B-Instruct
Verified models: Only the default
nvidia/cosmos-reason2-8bis verified for local deployment. For other models, ensure GPU memory meets requirements and refer to VLM NIM documentation.Custom VLM weights: Set
VLM_CUSTOM_WEIGHTSto the path of a directory containing custom VLM weights. See VLM Custom Weights in Prerequisites for download instructions.GPU assignment: Use
--vlm-device-idto pin the local VLM to a specific GPU (e.g. when sharing with a local LLM). See the quickstart deploy tab-set for examples.
Setup assumptions for remote VLMs#
Before you start, make sure you followed the prerequisites section. It is also recommended to review the different parameters described in the quickstart guide.
Register an account at NVIDIA NGC. Get an NGC API key.
Similarly login to NVIDIA NIM API catalog. Get a NIM CATALOG API key.
Ensure you are logged in to the NGC registry for docker to pull container images:
docker login nvcr.io Username: $oauthtoken Password: <PASTE_NGC_API_KEY_HERE>
Ensure you have
NGC_CLI_API_KEYset in your environment for the agent workflow to pull NIM containers:export NGC_CLI_API_KEY=<PASTE_NGC_API_KEY_HERE>
Managed remote VLMs#
Using frontier OpenAI models#
Register an account at the OpenAI API platform. Get an API key.
Choose a compatible model from the list in
curl https://api.openai.com/v1/models -H "Authorization: Bearer <PASTE_API_KEY_HERE>".Deploy the agent workflow and point it to the chosen model name. Ensure you use
--vlm-model-type openai:export VLM_ENDPOINT_URL=https://api.openai.com export OPENAI_API_KEY=<PASTE_API_KEY_HERE> export NVIDIA_API_KEY=<PASTE_NIM_CATALOG_API_KEY_HERE> scripts/dev-profile.sh up -p base \ --llm-device-id 0 \ --use-remote-vlm \ --vlm-model-type openai \ --vlm gpt-5.2
Examples of compatible openAI models:
Using other cloud-hosted models#
Reka#
Register an account at the Reka Platform. Get an API key.
Choose a compatible model from the list in
curl "https://api.reka.ai/v1/models" -H "X-Api-Key: <PASTE_API_KEY_HERE>".Modify the
met-blueprints/deployments/developer-workflow/dev-profile-base/.envfile to set theVSS_AGENT_CONFIG_FILEto point to the custom config:export VSS_AGENT_CONFIG_FILE=./deployments/developer-workflow/dev-profile-base/vss-agent/configs/config_reka.yml
Deploy the blueprint and point it to the chosen model name. Ensure you use
--vlm-model-type reka:export VLM_ENDPOINT_URL=https://api.reka.ai export OPENAI_API_KEY=<PASTE_API_KEY_HERE> # Uses openai-compatible API client export NVIDIA_API_KEY=<PASTE_NIM_CATALOG_API_KEY_HERE> scripts/dev-profile.sh up -p base \ --llm-device-id 0 \ --use-remote-vlm \ --vlm-model-type reka \ --vlm reka-flash
Self-hosted remote VLMs#
Using downloadable NIM Containers#
Check GPU requirements to ensure the machine that will host the VLM NIM container is supported.
Set
NGC_API_KEYin your environment to be passed to the containers so they can pull model weights:export NGC_API_KEY=<PASTE_NGC_API_KEY_HERE>
On the compatible remote machine within your own infrastructure, start the VLM NIM container and ensure it is reachable by the VSS agent workflow machine:
export LOCAL_NIM_CACHE=~/.cache/nim mkdir -p "$LOCAL_NIM_CACHE" chmod -R 777 "$LOCAL_NIM_CACHE" docker run -it --rm \ --gpus 'device=0' \ --ipc host \ --shm-size=16gb \ -e NGC_API_KEY \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \ -p 30082:8000 \ nvcr.io/nim/nvidia/cosmos-reason1-7b:latest
Note
In case you want to run several models (LLMs, VLMs) on the same machine, make sure to assign different GPUs using –gpus ‘device=1’ flag. You may assign multiple GPUs using –gpus ‘device=1,2,3’ etc.
Start the VSS agent workflow and point it to the remote VLM NIM container:
export VLM_ENDPOINT_URL=http://<remote-host>:30082 export NVIDIA_API_KEY=<PASTE_NIM_CATALOG_API_KEY_HERE> scripts/dev-profile.sh up -p base \ --llm-device-id 0 \ --use-remote-vlm
Examples of downloadable NIM containers:
Using vLLM Container#
You can deploy an opensource VLM from Hugging Face or any other source using the NVIDIA vLLM container. Check the NVIDIA vLLM container documentation for an introduction to vLLM and the vLLM container.
Check GPU requirements to ensure the machine that will host the VLM is supported.
On the compatible remote machine within your own infrastructure, start the NVIDIA vLLM container with a Hugging Face model and ensure it is reachable by the VSS agent workflow machine:
docker run --gpus 'device=0' -it --rm -p 30082:8000 nvcr.io/nvidia/vllm:26.01-py3 \ python3 -m vllm.entrypoints.openai.api_server \ --model Qwen/Qwen3-VL-8B-Instruct \ --trust-remote-code \ --tensor-parallel-size 1 \ --gpu-memory-utilization 0.85 \ --port 8000 \ --max-model-len 65536
Note
In case you want to run several models (LLMs, VLMs) on the same machine, make sure to assign different GPUs using –gpus ‘device=1’ flag. You may assign multiple GPUs using –gpus ‘device=1,2,3’ etc.
Note
If you see errors such as
ValueError: To serve at least one request with the models's max seq len (262144), you may need to use the--max-model-lenflag to limit the maximum context window (input + output tokens per request) to allow the model to fit on smaller GPUs. While some models support up to 256K tokens, VSS agent requests typically use under 10K-20K tokens. Setting a lower value (e.g. 65536) reduces GPU memory reserved for KV cache.Start the VSS agent workflow and point it to the remote vLLM model:
export VLM_ENDPOINT_URL=http://<remote-host>:30082 export NVIDIA_API_KEY=<PASTE_NIM_CATALOG_API_KEY_HERE> scripts/dev-profile.sh up -p base \ --llm-device-id 0 \ --use-remote-vlm
Known Issues#
Alert verification workflow is optimized with Cosmos Reason2 8B model and might not work with other models.