Configure the VLM/LLM#

VSS Blueprint supports other VLM/LLMs through remote endpoints. As long as the VLM/LLM API is OpenAI compatible, you can point the VSS Blueprint to use it for report generation, video understanding, long video summarization, and more.

Note

The VLM API needs to support video url input.

Before you start, make sure you followed the prerequisites section.

Using NIM Containers#

You can deploy the VLM/LLM using a downloadable NIM container from build.nvidia.com.

VLM Example: Cosmos Reason1 7B#

  1. Check GPU requirements

  2. Get a NGC API key and login to NGC registry: https://build.nvidia.com/nvidia/cosmos-reason1-7b/deploy

    docker login nvcr.io
    Username: $oauthtoken
    Password: <PASTE_API_KEY_HERE>
    
  3. Deploy VLM on your own infrastructure:

    Note

    The VLM needs to be deployed within the same local network as the VSS Blueprint.

    export NGC_API_KEY=<PASTE_API_KEY_HERE>
    export LOCAL_NIM_CACHE=~/.cache/nim
    mkdir -p "$LOCAL_NIM_CACHE"
    docker run -it --rm \
        --gpus 'device=0' \
        --ipc host \
        --shm-size=32GB \
        -e NGC_API_KEY \
        -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
        -u $(id -u) \
        -p 8000:8000 \
        nvcr.io/nim/nvidia/cosmos-reason1-7b:latest
    

LLM Example: Nvidia Nemotron Super 49B#

  1. Check GPU requirements

  2. Get a NGC API key and login to NGC registry: https://build.nvidia.com/nvidia/llama-3_3-nemotron-super-49b-v1_5

    Note

    You can skip this step if you already logged in when deploying the VLM.

    docker login nvcr.io
    Username: $oauthtoken
    Password: <PASTE_API_KEY_HERE>
    
  3. Deploy LLM on your own infrastructure:

    Note

    In case you want to start the LLM and VLM from the same machine, make sure to assign different GPUs using –gpus ‘device=1’ flag. You may assign multiple GPUs using –gpus ‘device=1,2,3’ etc.

    export NGC_API_KEY=<PASTE_API_KEY_HERE>
    export LOCAL_NIM_CACHE=~/.cache/nim
    mkdir -p "$LOCAL_NIM_CACHE"
    docker run -it --rm \
        --gpus 'device=1' \
        --shm-size=16GB \
        -e NGC_API_KEY \
        -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
        -u $(id -u) \
        -p 8010:8000 \
        nvcr.io/nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:latest
    

Start VSS Blueprint with NIM Endpoints#

deployments/dev-profile.sh up -p base --llm-mode remote --vlm-mode remote \
    --llm-base-url http://<remote-host>:8010 \
    --vlm-base-url http://<remote-host>:8000

Using vLLM Container#

You can deploy an opensource VLM/LLM from Hugging Face or any other source using the NVIDIA vLLM container. Check the NVIDIA vLLM container documentation for an introduction to vLLM and the vLLM container.

See the vLLM documentation for a list of supported VLMs and LLMs.

VLM Example: Qwen3-VL-8B-Instruct#

docker run --gpus 'device=0' -it --rm -p 8000:8000 nvcr.io/nvidia/vllm:26.01-py3
python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen3-VL-8B-Instruct --trust-remote-code --tensor-parallel-size 1 --gpu-memory-utilization 0.85 --port 8000 --max-model-len 65536

LLM Example: GPT-OSS-20B#

Note

In case you want to start the LLM and VLM from the same machine, make sure to assign different GPUs using –gpus ‘device=1’ flag. You may assign multiple GPUs using –gpus ‘device=1,2,3’ etc.

docker run --gpus 'device=1' -it --rm -p 8010:8000 nvcr.io/nvidia/vllm:26.01-py3
python3 -m vllm.entrypoints.openai.api_server --model OpenAI/gpt-oss-20b --trust-remote-code --tensor-parallel-size 1 --gpu-memory-utilization 0.85 --port 8000

Start VSS Blueprint with vLLM Endpoints#

deployments/dev-profile.sh up -p base --llm-mode remote --vlm-mode remote \
    --llm-base-url http://<remote-host>:8010 \
    --vlm-base-url http://<remote-host>:8000

Known Issues#

  • Alert verification workflow is optimized with Cosmos Reason2 8B model and might not work with other models.