Configure the VLM#
VSS is designed to be configurable with many VLMs, such as:
VSS supports integrating custom VLM models. Depending on the model to be integrated, some configurations must be updated or the interface code is implemented. The model can ONLY be selected at initialization time.
Following segments explain those approaches in details.
3rd-Party VLM Endpoints#
We provide the option to utilize externally hosted 3rd-party VLMs which follow the OpenAI API standard. Access to these endpoints are provided through the 3rd party.
Supported Model |
Developer |
---|---|
GPT4o |
OpenAI |
OpenAI (GPT-4o)#
To use GPT-4o as the VLM model in VSS, see Configuration Options and modify the config VLM_MODEL_TO_USE
.
Pre-Requisite: API key from https://platform.openai.com/api-keys
Steps:
Fetch the Helm Chart following Deploy Using Helm.
Create a new
overrides.yaml
file.Copy the example overrides file from Configuration Options.
Edit the
overrides.yaml
file and changeVLM_MODEL_TO_USE
tovalue: openai-compat
and add the environment variable for theOPENAI_API_KEY
as shown below.vss: applicationSpecs: vss-deployment: containers: vss: env: - name: VLM_MODEL_TO_USE value: openai-compat - name: OPENAI_API_KEY valueFrom: secretKeyRef: name: openai-api-key-secret key: OPENAI_API_KEY
Create the OpenAI API Key secret:
sudo microk8s kubectl create secret generic openai-api-key-secret --from-literal=OPENAI_API_KEY=$OPENAI_API_KEY
Install the Helm Chart:
sudo microk8s helm install vss-blueprint nvidia-blueprint-vss-2.3.1.tgz --set global.ngcImagePullSecretName=ngc-docker-reg-secret -f overrides.yaml
Follow steps to Launch VSS UI at Launch VSS UI.
Community Models#
We support multiple community models that are open source, developed through research, or offered by 3rd-parties. If the VLM model provides an OpenAI compatible REST API, refer to Configuration Options. Here is a list of models tested within VSS and steps:
Supported Model |
Developer |
Size (Paramenters) |
---|---|---|
NVILA |
NVIDIA |
15b |
NEVA |
NVIDIA |
22b |
Fuyu |
NVIDIA |
8b |
Auto-download NGC models (VILA & NVILA)#
Add the following env variables in the overrides.yaml
file before launching the VSS blueprint:
VILA 34b
vss:
applicationSpecs:
vss-deployment:
containers:
vss:
env:
- name: VLM_MODEL_TO_USE
value: vila-1.5
- name: MODEL_PATH
value: "ngc:nim/nvidia/vila-1.5-40b:vila-yi-34b-siglip-stage3_1003_video_v8"
...
NVILA HighRes 15b
vss:
applicationSpecs:
vss-deployment:
containers:
vss:
env:
- name: VLM_MODEL_TO_USE
value: nvila
- name: MODEL_PATH
value: "ngc:nvidia/tao/nvila-highres:nvila-lite-15b-highres-lita"
...
Local NGC Models (VILA & NVILA)#
Follow the steps below to use VLM weights that have been downloaded to a local filepath. This can be used as an alternative way to deploy the VILA 34b model and must be used for NVILA HighRes 15b model.
Download the NGC CLI which will be able to download the models to a specified location.
# Download NGC CLI
wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/3.64.2/files/ngccli_linux.zip -O ngccli_linux.zip && unzip ngccli_linux.zip
chmod u+x ngc-cli/ngc
export PATH="$PATH:$(pwd)/ngc-cli"
Download the model weights you wish to store locally.
VILA 34b
# Download the VILA weights
export NGC_API_KEY=<your-legacy-api-key>
export NGC_CLI_ORG=nim
export NGC_CLI_TEAM=nvidia
ngc registry model download-version "nim/nvidia/vila-1.5-40b:vila-yi-34b-siglip-stage3_1003_video_v8"
chmod a+w vila-1.5-40b_vvila-yi-34b-siglip-stage3_1003_video_v8
NVILA HighRes 15b
# Download the NVILA weights
ngc registry model download-version "nvidia/tao/nvila-highres:nvila-lite-15b-highres-lita"
chmod a+w nvila-highres_vnvila-lite-15b-highres-lita
The NVILA weights, for example, will be downloaded to <current-directory>/nvila-highres_vnvila-lite-15b-highres-lita
. Use this path to mount the weights as shown in the next step.
Specify the following in your overrides file (see Configuration Options):
vss:
applicationSpecs:
vss-deployment:
containers:
vss:
env:
- name: VLM_MODEL_TO_USE
value: vila-1.5 # or nvila
- name: MODEL_PATH
value: "/tmp/vila"
extraPodVolumes:
- name: local-vila-checkpoint
hostPath:
path: </path/to/local/vila-checkpoint>
extraPodVolumeMounts:
- name: local-vila-checkpoint
mountPath: /tmp/vila
The vila checkpoint directory </path/to/local/vila-checkpoint>
contents should be similar to:
$ ls </path/to/local/vila-checkpoint>
config.json llm mm_projector trainer_state.json vision_tower
Install the Helm Chart
OpenAI Compatible REST API#
If the VLM model provides an OpenAI compatible REST API, refer to Configuration Options.
vLLM served OpenAI API Compatible VLM#
VSS supports dropping in VLMs that are OpenAI API compatible.
The below example shows how to drop in a VLM served through vLLM, a popular high-throughput and memory-efficient inference and serving engine. Many community models on HuggingFace can be served through vllm.
Download the model, run vllm serve, and test the local endpoint.
Example steps to download + serve “Qwen/Qwen2.5-VL-7B-Instruct”:
Instructions to install vllm can be found here: QwenLM/Qwen2.5-VL
More details available at QwenLM/ and https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct .
serve the model using vllm:
vllm serve Qwen/Qwen2.5-VL-7B-Instruct --port 38011 --host 0.0.0.0 --dtype bfloat16 --limit-mm-per-prompt image=10,video=1 --served-model-name Qwen2.5VL-7B-instruct
Note
vllm serve errors were observed for Qwen model with latest transformers version (July 2025) Please follow issue and workaround at: vllm-project/vllm-ascend#1470 Workaround: pip install “transformers<4.53.0”
Specify the following in your overrides file (see Configuration Options):
vss:
applicationSpecs:
vss-deployment:
containers:
vss:
env:
- name: VLM_MODEL_TO_USE
value: openai-compat
- name: OPENAI_API_KEY
value: "empty" #random value; unused
- name: VIA_VLM_ENDPOINT
value: "http://<host-IP>:38011/v1" #match vllm --port and the host-IP
- name: VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME
value: "Qwen2.5VL-7B-instruct" #match vllm --served-model-name
Install the Helm Chart
Other Custom Models#
VSS allows you to drop in your own models to the model directory by providing the pre-trained weight of the model or a model with REST API endpoint and implementing an interface to bridge to the VSS pipeline.
The interface includes an inference.py
file and a manifest.yaml
.
The manifest.yaml
file is used to describe the configuration of the model. An example is shown below:
input:
number_of_frames: 1 # Number of frames to sample from each chunk.
jpeg_encoded: false # Whether to encode the frames in JPEG format or pass as raw frame torch tensors.
The inference.py
file is used to define the interface for the model. An example is shown below:
class Inference:
def __init__(self):
# Load and initialize the model.
pass
def get_embeddings(self, tensor:torch.tensor) -> tensor:torch.tensor:
# Generate video embeddings for the chunk / file.
# Do not implement if explicit video embeddings are not supported by model
return tensor
def generate(self, prompt:str, input:torch.tensor | list[np.ndarray], configs:Dict):
# Generate summary string from the input prompt and frame/embedding input.
# configs contains VLM generation parameters like
# max_new_tokens, seed, top_p, top_k, temperature
return summary
Based on the chunk size selected during summarize API call, equidistant
number_of_frames
of frames will be sampled from each chunk and passed to the generate
method.
The generate
method will be called for each chunk.
It will be passed the frames sampled for that chunk along with the text prompt and generation parameters
i.e. seed
, top_k
, top_p
and temperature
if set by the user as part of the configs
dictionary.
When jpeg_encoded
parameter in manifest.yaml
is set to true, the frames will be passed as a list of numpy arrays containing encoded jpeg bytes.
When this parameter is false or unset, the frames will be passed as a list of torch tensors in RGB HWC format.
The optional get_embeddings
method is used to generate embeddings for
a given set of frames wrapped in a TCHW tensor and must be removed if
the model doesn’t support the feature.
The generate
method can be used to implement inference using models that are executed locally on the system or use
remote models with REST APIs.
Some examples are available at NVIDIA-AI-Blueprints/video-search-and-summarization
Examples include models fuyu8b
and neva
.
The VSS container image or the Blueprint Helm Chart may need to be modified to use custom VLMs. Configuration Options mentions how to use a custom VSS container image and how to specify the model path for custom models. If mounting of custom paths is required, the VSS subchart in the Blueprint Helm Chart can be modified to mount the custom paths.
Example:
For fuyu8b, model weights need to be downloaded, refer to the Fuyu8b README for more details.
Once model weights are downloaded using the Fuyu8b README, the directory structure should look like:
ls /path/to/fuyu8b
inference.py fuyu8b model-00002-of-00002.safetensors skateboard.png
architecture.png generation_config.json model.safetensors.index.json special_tokens_map.json
bus.png added_tokens.json preprocessor_config.json tokenizer_config.json
chart.png manifest.yaml __pycache__ tokenizer.json
config.json model-00001-of-00002.safetensors README.md tokenizer.model
Directory structure for neva looks like:
ls /path/to/neva
inference.py manifest.yaml
Next, you can add the following to the Helm overrides file using the instructions in Configuration Options.
neva
requires NVIDIA Personal Key to access the API from build.nvidia.com. For generating NVIDIA_API_KEY
and creating the secret, refer to Using NIMs from build.nvidia.com.
They key is automatically used by the blueprint if the kubernetes secret nvidia-api-key-secret
is created.
vss:
applicationSpecs:
vss-deployment:
containers:
vss:
env:
- name: VLM_MODEL_TO_USE
value: custom
- name: MODEL_PATH
value: "/tmp/custom-model"
extraPodVolumes:
- name: custom-model
hostPath:
path: /path/to/fuyu8b # contains inference.py and manifest.yaml
extraPodVolumeMounts:
- name: custom-model
mountPath: /tmp/custom-model
Note
Custom VLM models may not work well with GPU-sharing topology.