Configure the VLM/LLM#
VSS Blueprint supports other VLM/LLMs through remote endpoints. As long as the VLM/LLM API is OpenAI compatible, you can point the VSS Blueprint to use it for report generation, video understanding, long video summarization, and more.
Note
The VLM API needs to support video url input.
Before you start, make sure you followed the prerequisites section.
Using NIM Containers#
You can deploy the VLM/LLM using a downloadable NIM container from build.nvidia.com.
VLM Example: Cosmos Reason1 7B#
Check GPU requirements
Get a NGC API key and login to NGC registry: https://build.nvidia.com/nvidia/cosmos-reason1-7b/deploy
docker login nvcr.io Username: $oauthtoken Password: <PASTE_API_KEY_HERE>
Deploy VLM on your own infrastructure:
Note
The VLM needs to be deployed within the same local network as the VSS Blueprint.
export NGC_API_KEY=<PASTE_API_KEY_HERE> export LOCAL_NIM_CACHE=~/.cache/nim mkdir -p "$LOCAL_NIM_CACHE" docker run -it --rm \ --gpus 'device=0' \ --ipc host \ --shm-size=32GB \ -e NGC_API_KEY \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \ -u $(id -u) \ -p 8000:8000 \ nvcr.io/nim/nvidia/cosmos-reason1-7b:latest
LLM Example: Nvidia Nemotron Super 49B#
Check GPU requirements
Get a NGC API key and login to NGC registry: https://build.nvidia.com/nvidia/llama-3_3-nemotron-super-49b-v1_5
Note
You can skip this step if you already logged in when deploying the VLM.
docker login nvcr.io Username: $oauthtoken Password: <PASTE_API_KEY_HERE>
Deploy LLM on your own infrastructure:
Note
In case you want to start the LLM and VLM from the same machine, make sure to assign different GPUs using –gpus ‘device=1’ flag. You may assign multiple GPUs using –gpus ‘device=1,2,3’ etc.
export NGC_API_KEY=<PASTE_API_KEY_HERE> export LOCAL_NIM_CACHE=~/.cache/nim mkdir -p "$LOCAL_NIM_CACHE" docker run -it --rm \ --gpus 'device=1' \ --shm-size=16GB \ -e NGC_API_KEY \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \ -u $(id -u) \ -p 8010:8000 \ nvcr.io/nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:latest
Start VSS Blueprint with NIM Endpoints#
deployments/dev-profile.sh up -p base --llm-mode remote --vlm-mode remote \
--llm-base-url http://<remote-host>:8010 \
--vlm-base-url http://<remote-host>:8000
Using vLLM Container#
You can deploy an opensource VLM/LLM from Hugging Face or any other source using the NVIDIA vLLM container. Check the NVIDIA vLLM container documentation for an introduction to vLLM and the vLLM container.
See the vLLM documentation for a list of supported VLMs and LLMs.
VLM Example: Qwen3-VL-8B-Instruct#
docker run --gpus 'device=0' -it --rm -p 8000:8000 nvcr.io/nvidia/vllm:26.01-py3
python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen3-VL-8B-Instruct --trust-remote-code --tensor-parallel-size 1 --gpu-memory-utilization 0.85 --port 8000 --max-model-len 65536
LLM Example: GPT-OSS-20B#
Note
In case you want to start the LLM and VLM from the same machine, make sure to assign different GPUs using –gpus ‘device=1’ flag. You may assign multiple GPUs using –gpus ‘device=1,2,3’ etc.
docker run --gpus 'device=1' -it --rm -p 8010:8000 nvcr.io/nvidia/vllm:26.01-py3
python3 -m vllm.entrypoints.openai.api_server --model OpenAI/gpt-oss-20b --trust-remote-code --tensor-parallel-size 1 --gpu-memory-utilization 0.85 --port 8000
Start VSS Blueprint with vLLM Endpoints#
deployments/dev-profile.sh up -p base --llm-mode remote --vlm-mode remote \
--llm-base-url http://<remote-host>:8010 \
--vlm-base-url http://<remote-host>:8000
Known Issues#
Alert verification workflow is optimized with Cosmos Reason2 8B model and might not work with other models.