Configure the VLM#

This page describes how to configure the VLM for the VSS agent workflow—both local (NIM on the same machine as the agent) and remote (VLM served elsewhere). You can run the LLM and VLM locally or remotely, in any combination that suits your setup.

For remote VLMs, as long as the API is OpenAI compatible, you can point the agent to it for video analysis during report generation, video understanding, long video summarization, and more. Remote endpoints can enable rapid startup when they are already up and running.

Local VLM customization#

When deploying with a local VLM, the agent workflow uses a default NIM endpoint. You can customize it as follows:

Default model: nvidia/cosmos-reason2-8b (configured as a local NIM endpoint).
Custom NIM env file: At deploy time, pass --vlm-env-file with the path (absolute or relative to the current directory) to an env file that overrides NIM settings. See NIM configuration settings in Prerequisites.
Different local model: Pass --vlm with one of the supported model names:
- nvidia/cosmos-reason1-7b
- Qwen/Qwen3-VL-8B-Instruct
Verified models: Only the default nvidia/cosmos-reason2-8b is verified for local deployment. For other models, ensure GPU memory meets requirements and refer to VLM NIM documentation.
Custom VLM weights: Set VLM_CUSTOM_WEIGHTS to the path of a directory containing custom VLM weights. See VLM Custom Weights in Prerequisites for download instructions.
GPU assignment: Use --vlm-device-id to pin the local VLM to a specific GPU (e.g. when sharing with a local LLM). See the quickstart deploy tab-set for examples.

Setup assumptions for remote VLMs#

Before you start, make sure you followed the prerequisites section. It is also recommended to review the different parameters described in the quickstart guide.

Register an account at NVIDIA NGC. Get an NGC API key.
Similarly login to NVIDIA NIM API catalog. Get a NIM CATALOG API key.

Ensure you are logged in to the NGC registry for docker to pull container images:

docker login nvcr.io
Username: $oauthtoken
Password: <PASTE_NGC_API_KEY_HERE>

Ensure you have NGC_CLI_API_KEY set in your environment for the agent workflow to pull NIM containers:
```
export NGC_CLI_API_KEY=<PASTE_NGC_API_KEY_HERE>
```

Managed remote VLMs#

Using frontier OpenAI models#

Register an account at the OpenAI API platform. Get an API key.
Choose a compatible model from the list in curl https://api.openai.com/v1/models -H "Authorization: Bearer <PASTE_API_KEY_HERE>".

Deploy the agent workflow and point it to the chosen model name. Ensure you use --vlm-model-type openai:

export VLM_ENDPOINT_URL=https://api.openai.com
export OPENAI_API_KEY=<PASTE_API_KEY_HERE>
export NVIDIA_API_KEY=<PASTE_NIM_CATALOG_API_KEY_HERE>

scripts/dev-profile.sh up -p base \
    --llm-device-id 0 \
    --use-remote-vlm \
    --vlm-model-type openai \
    --vlm gpt-5.2

Examples of compatible openAI models:

Using other cloud-hosted models#

Reka#

Register an account at the Reka Platform. Get an API key.
Choose a compatible model from the list in curl "https://api.reka.ai/v1/models" -H "X-Api-Key: <PASTE_API_KEY_HERE>".
Modify the met-blueprints/deployments/developer-workflow/dev-profile-base/.env file to set the VSS_AGENT_CONFIG_FILE to point to the custom config:
```
export VSS_AGENT_CONFIG_FILE=./deployments/developer-workflow/dev-profile-base/vss-agent/configs/config_reka.yml
```

Deploy the blueprint and point it to the chosen model name. Ensure you use --vlm-model-type reka:

export VLM_ENDPOINT_URL=https://api.reka.ai
export OPENAI_API_KEY=<PASTE_API_KEY_HERE> # Uses openai-compatible API client
export NVIDIA_API_KEY=<PASTE_NIM_CATALOG_API_KEY_HERE>

scripts/dev-profile.sh up -p base \
    --llm-device-id 0 \
    --use-remote-vlm \
    --vlm-model-type reka \
    --vlm reka-flash

Self-hosted remote VLMs#

Using downloadable NIM Containers#

Check GPU requirements to ensure the machine that will host the VLM NIM container is supported.
Set NGC_API_KEY in your environment to be passed to the containers so they can pull model weights:
```
export NGC_API_KEY=<PASTE_NGC_API_KEY_HERE>
```

On the compatible remote machine within your own infrastructure, start the VLM NIM container and ensure it is reachable by the VSS agent workflow machine:

export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
chmod -R 777 "$LOCAL_NIM_CACHE"
docker run -it --rm \
    --gpus 'device=0' \
    --ipc host \
    --shm-size=16gb \
    -e NGC_API_KEY \
    -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
    -p 30082:8000 \
    nvcr.io/nim/nvidia/cosmos-reason1-7b:latest

Note

In case you want to run several models (LLMs, VLMs) on the same machine, make sure to assign different GPUs using –gpus ‘device=1’ flag. You may assign multiple GPUs using –gpus ‘device=1,2,3’ etc.

Start the VSS agent workflow and point it to the remote VLM NIM container:

export VLM_ENDPOINT_URL=http://<remote-host>:30082
export NVIDIA_API_KEY=<PASTE_NIM_CATALOG_API_KEY_HERE>

scripts/dev-profile.sh up -p base \
    --llm-device-id 0 \
    --use-remote-vlm

Examples of downloadable NIM containers:

Using vLLM Container#

You can deploy an opensource VLM from Hugging Face or any other source using the NVIDIA vLLM container. Check the NVIDIA vLLM container documentation for an introduction to vLLM and the vLLM container.

Check GPU requirements to ensure the machine that will host the VLM is supported.
On the compatible remote machine within your own infrastructure, start the NVIDIA vLLM container with a Hugging Face model and ensure it is reachable by the VSS agent workflow machine:
```
docker run --gpus 'device=0' -it --rm -p 30082:8000 nvcr.io/nvidia/vllm:26.01-py3 \
    python3 -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen3-VL-8B-Instruct \
    --trust-remote-code \
    --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.85 \
    --port 8000 \
    --max-model-len 65536
```
Note

In case you want to run several models (LLMs, VLMs) on the same machine, make sure to assign different GPUs using –gpus ‘device=1’ flag. You may assign multiple GPUs using –gpus ‘device=1,2,3’ etc.

Note

If you see errors such as ValueError: To serve at least one request with the models's max seq len (262144), you may need to use the --max-model-len flag to limit the maximum context window (input + output tokens per request) to allow the model to fit on smaller GPUs. While some models support up to 256K tokens, VSS agent requests typically use under 10K-20K tokens. Setting a lower value (e.g. 65536) reduces GPU memory reserved for KV cache.

Start the VSS agent workflow and point it to the remote vLLM model:

export VLM_ENDPOINT_URL=http://<remote-host>:30082
export NVIDIA_API_KEY=<PASTE_NIM_CATALOG_API_KEY_HERE>

scripts/dev-profile.sh up -p base \
    --llm-device-id 0 \
    --use-remote-vlm

Known Issues#

Alert verification workflow is optimized with Cosmos Reason2 8B model and might not work with other models.