Model Details#
VSS uses the following models:
VLM Models
ASR Models (if enabled)
CA-RAG Models
CV Pipeline Models (if enabled)
VSS VLM Models#
NVIDIA Nemotron Nano v2 12B VL:
NVIDIA Nemotron Nano v2 12B VL model enables multi-image reasoning and video understanding, along with strong document intelligence, visual Q&A and summarization capabilities.
This model can be interfaced with VSS using the openai-compatible REST API.
Please refer to OpenAI Compatible REST API for helm deployment and OpenAI Compatible REST API for docker compose deployment.
Example values of the VSS configuration environment variables for the Nemotron Nano v2 12B VL model are shown below:
export VLM_MODEL_TO_USE=openai-compat
export OPENAI_API_KEY="empty" #random value; unused
export VIA_VLM_ENDPOINT="http://host.docker.internal:38110/v1" #match the NIM's host IP and port
export VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME="nemotron-nano-12b-v2-vl" #match NIM_SERVED_MODEL_NAME
To deploy the NVIDIA Nemotron Nano v2 12B VL NIM, follow the detailed deployment steps at NVIDIA Nemotron Nano v2 12B VL Deploy.
When running the NIM docker container, use the -e NIM_SERVED_MODEL_NAME flag as shown in the example below:
docker run -d -it --rm \
--gpus all \
--shm-size=16GB \
-e NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 38110:8000 \
-e NIM_SERVED_MODEL_NAME=nemotron-nano-12b-v2-vl \
nvcr.io/nim/nvidia/nemotron-nano-12b-v2-vl:latest
Important
Ensure that the NIM_SERVED_MODEL_NAME value (see NIM deployment configuration)
matches the VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME environment variable in the VSS configuration shown above.
Cosmos-Reason1: Cosmos-Reason1 is a Video Language Model (VLM) developed by NVIDIA. This model is deployed locally as part of the blueprint.
Cosmos-Reason1 is capable of reasoning about both vision and language and has temporal understanding.
This is the default model used in VSS deployment.
VILA 1.5: VILA 1.5 is Video Language Model (VLM) developed by NVIDIA. This model is deployed locally as part of the blueprint.
Note
Support for VILA model and VILA related configurations are scheduled to be deprecated in the future.
NVILA Model:
NVILA is a visual language model (VLM) pretrained with interleaved image-text data at scale, enabling multi-image VLM developed by NVIDIA. This model is deployed locally as part of the blueprint when configured. There are currently two NVILA 15B variants supported in VSS.
The Lite version processes input images at 448x448 and does not include a temporal decoder.
The HighRes version has dynamic resolution tiling that ranges from 1.8K to 1.3K input resolution based on the aspect ratio of the input video. This allows the model to observe finer details in the video compared to the Lite model. Additionally it includes a temporal decoder, which enhances the model’s ability to output precise timestamps of events in the input video.
Local models like Cosmos-Reason1, NVILA and VILA 1.5 provides users with the following benefits over proprietary models:
Data Privacy: Deploy on-prem where your data is protected, as it’s not shared for inference or training.
Flexible deployment: Deploy anywhere and maintain control and scalability of your model.
Lower Latency: Deploy near the source of data for faster inference.
Lower Cost: Reduced cost of inference when compared to proprietary AI services.
GPT-4o: VSS offers support to use OpenAI models like GPT-4o as VLM. GPT-4o is used as a remote endpoint.
To use GPT-4o model in VSS, refer to OpenAI (GPT-4o).
Custom VLM Models: VSS supports integrating custom VLM models. Refer to Other Custom Models. Based on the implementation, the model could be locally deployed or used as remote endpoint.
VSS ASR Models#
Parakeet-CTC-XL-0.6B: Parakeet-CTC-XL-0.6B is an Automatic Speech Recognition (ASR) model developed by NVIDIA. This model is trained on ASRSet with over 35000 hours of English (en-US) speech. The model transcribes speech in lower case English alphabet along with spaces and apostrophes.
VSS CA-RAG Models#
LLaMA 3.1 70B Instruct: The LLaMA 3.1 70B Instruct NIM is used for Guardrails and by CA-RAG for summarization. This model is deployed locally as part of the blueprint.
NVIDIA Retrieval QA Llama3.2 1B v2 Embedding: The NVIDIA Retrieval QA Llama3.2 1B Embedding NIM is used as a text embedding model for text captions and query. This model is deployed locally as part of the blueprint.
NVIDIA Retrieval QA Llama3.2 1B v2 Reranking: The NVIDIA Retrieval QA Llama3.2 1B Reranking NIM is used as a reranking model for Q&A. This model is deployed locally as part of the blueprint.
GPT-4o: GPT-4o API is used for tool calling as part of GraphRAG for Q&A. GPT-4o is used as a remote endpoint.
Note
Only NVIDIA Retrieval QA Llama3.2 1B Embedding NIM and NVIDIA Retrieval QA Llama3.2 1B Reranking NIM models are supported for embedding and reranking respectively.
VSS CV Pipeline Models#
SAM2: SAM2 is an open source model for instance segmentation. The model is downloaded and converted from PyTorch to ONNX, and accelerated using TensorRT for fp16 precision and batched inference. Currently, image encoder and mask decoder are used for per frame segmentation, and memory bank is not supported yet. SAM2 is used with a multi-object tracker in CV pipeline for generating and tracking object masks. This model is deployed locally as part of the blueprint. Review the license terms of this open source project before use.
NVIDIA ReIdentificationNet: ReIdentificationNet generates embeddings for identifying objects captured in different scenes. It is a high accuracy ResNet-50 model with feature length 256. This model is used by tracker for highly accurate object tracking. This model is deployed locally as part of the blueprint.
Grounding DINO: Grounding DINO is a model for object detection and localization. This model is used to detect the objects in CV pipeline using the prompt provided by the user. This model is deployed locally as part of the blueprint.