Model Details#
VSS uses the following models:
VLM Models
ASR Models (if enabled)
CA-RAG Models
CV Pipeline Models (if enabled)
VSS VLM Models#
VILA 1.5: VILA 1.5 is Video Language Model (VLM) developed by NVIDIA. This model is deployed locally as part of the blueprint.
VILA 1.5 provides users with the following benefits over proprietary models:
Data Privacy: Deploy on-prem where your data is protected, as it’s not shared for inference or training.
Flexible deployment: Deploy anywhere and maintain control and scalability of your model.
Lower Latency: Deploy near the source of data for faster inference.
Lower Cost: Reduced cost of inference when compared to proprietary AI services.
This is the default model used in VSS deployment.
Note
You need an NGC account to access this model.
Fine tuning VILA 1.5 is no longer supported. Please move to use NVILA 15B HighRes instead. More details in Fine-tuning NVILA model (LoRA).
NVILA Model:
NVILA is a visual language model (VLM) pretrained with interleaved image-text data at scale, enabling multi-image VLM developed by NVIDIA. This model is deployed locally as part of the blueprint when configured. There are currently two NVILA 15B variants supported in VSS.
The Lite version processes input images at 448x448 and does not include a temporal decoder.
The HighRes version has dynamic resolution tiling that ranges from 1.8K to 1.3K input resolution based on the aspect ratio of the input video. This allows the model to see finer details in the video compared to the Lite model. Additionally it includes a temporal decoder which enhances the model’s ability to output precise timestamps of events in the input video.
The HighRes version is also fine-tunable model for improved accuracy on your specific use-cases. Please refer to Fine-tuning NVILA model (LoRA).
GPT-4o: VSS offers support to use OpenAI models like GPT-4o as VLM. GPT-4o is used as a remote endpoint.
To use GPT-4o model in VSS, see Configuring for GPT-4o.
Custom VLM Models: VSS supports integrating custom VLM models. Refer to OpenAI Compatible REST API. Based on the implementation, the model could be locally deployed or used as remote endpoint.
VSS ASR Models#
Parakeet-CTC-XL-0.6B: Parakeet-CTC-XL-0.6B is an Automatic Speech Recognition (ASR) model developed by NVIDIA. This model is trained on ASRSet with over 35000 hours of English (en-US) speech. The model transcribes speech in lower case English alphabet along with spaces and apostrophes.
VSS CA-RAG Models:#
LLaMA 3.1 70b Instruct: The LLaMA 3.1 70b Instruct NIM is used for Guardrails and by CA-RAG for summarization. This model is deployed locally as part of the blueprint.
NVIDIA Retrieval QA Llama3.2 1b v2 Embedding: The NVIDIA Retrieval QA Llama3.2 1b Embedding NIM is used as a text embedding model for text captions and query. This model is deployed locally as part of the blueprint.
NVIDIA Retrieval QA Llama3.2 1b v2 Reranking: The NVIDIA Retrieval QA Llama3.2 1b Reranking NIM is used as a reranking model for Q&A. This model is deployed locally as part of the blueprint.
GPT-4o: GPT-4o API is used for tool calling as part of GraphRAG for Q&A. GPT-4o is used as a remote endpoint.
Note
Only NVIDIA Retrieval QA Llama3.2 1b Embedding NIM and NVIDIA Retrieval QA Llama3.2 1b Reranking NIM models are supported for embedding and reranking respectively.
VSS CV Pipeline Models#
SAM2: SAM2 is an open source model for instance segmentation. The model is downloaded and converted from PyTorch to onnx, and accelerated using TensorRT for fp16 precision and batched inference. Currently, image encoder and mask decoder are used for per frame segmentation, and memory bank is not supported yet. SAM2 is used with a multi-object tracker in CV pipeline for generating and tracking object masks. This model is deployed locally as part of the blueprint. Review the license terms of this open source project before use.
NVIDIA ReIdentificationNet: ReIdentificationNet generates embeddings for identifying objects captured in different scenes. It is a high accuracy ResNet-50 model with feature length 256. This model is used by tracker for highly accurate object tracking. This model is deployed locally as part of the blueprint.
Grounding DINO: Grounding DINO is a model for object detection and localization. This model is used to detect the objects in CV pipeline using the prompt provided by the user. This model is deployed locally as part of the blueprint.