Custom Deployment#

The Video Search and Summarization Agent blueprint can be deployed through docker compose to allow for more customization and flexibility.

Deployment Components#

To fully launch the VSS blueprint the following parts are required:

LLM

Embedding Model

Reranking Model

VLM

VSS Engine

The helm chart deployment automatically launches all of these parts, however deploying VSS with docker compose, only the VSS Engine is launched which does not include the LLM, Embedding or ReRanker models. These models must launched separately or configured to use remote endpoints.

For the VLM in VSS there are three options:

Use the built in VLM

For the lowest latency experience, VSS supports a built in VLM. When using this VLM, the model is tightly integrated with the video decoding pipeline, leading to higher throughput and lower summarization latency. If configured, VSS will automatically pull the VLM and build an optimized TensorRT engine to run in the video ingestion pipeline.
Use an OpenAI Compatible VLM

Any OpenAI compatible VLM can be used with VSS. This could be a proprietary VLM running in the cloud or a local VLM launched with a third party framework like vLLM or SGLang.
Use a Custom VLM

To use a VLM that does not provide an Open AI compatible REST API interface, you can follow the Custom Models section on how to add support for a new VLM.

Deployment Examples#

Three example docker compose setups are provided on GitHub to quickly deploy VSS through docker compose on a range of hardware platforms to demonstrate how VSS can be configured to use both local and remote endpoints for the different components.

Deployment Sample	VLM (VILA-1.5 35B)	LLM (Llama 3.1 70B)	Embedding (llama-3.2-nv-embedqa-1b-v2)	Reranker (llama-3.2-nv-rerankqa-1b-v2)	Minimum GPU Requirement
remote_vlm_deployment	Remote	Remote	Remote	Remote	Minimum 8GB VRAM GPU
remote_llm_deployment	Local	Remote	Remote	Remote	1xH100, 1xA100, 2xL40S
local_deployment	Local	Local	Local	Local	4xH100, 8xA100, 8xL40S

In addition to the minimum GPU requirements, the following prerequisites must be met:

Ubuntu 22.04

NVIDIA driver 535.161.08 (Recommended minimum version)

CUDA 12.2+ (CUDA driver installed with NVIDIA driver)

NVIDIA Container Toolkit 1.13.5+

Docker 27.5.1+

Docker Compose 2.32.4+

To get started, first clone the GitHub repository to get the docker compose samples

git clone https://github.com/NVIDIA-AI-Blueprints/video-search-and-summarization.git
cd video-search-and-summarization/docker

Then based on your available hardware, follow one of the sections below to launch VSS with docker compose.

Remote VLM Deployment#

The remote_vlm_deployment folder, contains an example of how to launch VSS using remote endpoints for the VLM, LLM, Embedding and Reranker models. This allows VSS to run with minimal hardware requirements. A modern GPU with at least 8GB of VRAM is recommended.

To run this deployment, first get an NVIDIA API key from build.nvidia.com and an OpenAI API Key to use GPT-4o as the remote VLM. Any OpenAI compatible VLM can also be used for this.

cd remote_vlm_deployment

If you look in the config.yaml file, you will see that the LLM, Reranker and Embedding model configurations have been set to use remote endpoints from build.nvidia.com.

Inside the remote_vlm_deployment folder, edit the .env file and populate the NVIDIA_API_KEY and OPENAI_API_KEY fields. Optionally, the VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME can be adjusted to use a different model. The VIA_VLM_ENDPOINT can also be adjusted to point to any OpenAI compatible VLM.

After setting your API keys in the .env file, you can launch VSS with docker compose.

docker compose up

Once VSS has loaded you can access the UI at port 9100.

Remote LLM Deployment#

The remote_llm_deployment folder, contains an example of how to launch VSS using the local built in VLM and remote endpoints for the LLM, Embedding and ReRanker models. This requires at least 80GB of VRAM to load the built in VILA-1.5 35B model. This can be run on systems with 2xL40s, 1xA100 80GB or 1xH100 GPUs.

cd remote_llm_deployment

If you look in the config.yaml file, you will see that the LLM, Reranker and Embedding model configurations have been set to use remote endpoints from build.nvidia.com.

Inside the remote_llm_deployment folder, edit the .env file and populate the NVIDIA_API_KEY and NGC_API_KEY fields. You can see in this file, the VLM is set to VILA-1.5.

After setting your API keys in the .env file, you can launch VSS with docker compose.

docker compose up

Once VSS has loaded you can access the UI at port 9100.

Local Deployment#

The local_deployment folder contains an example of how to launch VSS using the local built in VLM and local endpoints for the LLM, Embedding and Reranker models. This requires at least 4xH100, 8xA100 or 8xL40s. The example configuration assumes a 4xH100 deployment.

To run this example, you must first launch local instances of the following components:

LLM
Embedding
Reranker

Step 1: Deploy LLM#

VSS can be configured to point to any OpenAI compatible LLM. This could be an LLM running on your local system through a NIM or a proprietary LLM deployed in the cloud. To deploy an LLM, the Llama 3.1 70B NIM is recommended. The LLM could also be served by a third party framework such as vLLM or SGLang.

For summarization only use cases, models as small as Llama 3.1 8B can work well. However, for interactive Q&A, larger models are recommended to properly interface with the Graph Database.

Go to build.nvidia.com to explore available LLM NIMs and the LLM NIM Documentation for more details on deployment and GPU requirements.

To launch the recommended Llama 3.1 70B NIM, first login to nvcr.io.

docker login nvcr.io

Then supply your API Key from build.nvidia.com

Username: $oauthtoken
Password: <PASTE_API_KEY_HERE>

Once logged, modify the snippet below to add your API Key and run the command to launch the Llama 3.1 70B NIM on GPUs 0 and 1.

export NGC_API_KEY=<PASTE_API_KEY_HERE>
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
docker run -it --rm \
--gpus '"device=0,1"' \
--shm-size=16GB \
-e NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
nvcr.io/nim/meta/llama-3.1-70b-instruct:latest

The first time this command is run, it will take some time to download and deploy the model. Ensure the LLM works by running a sample curl command.

curl -X 'POST' \
   'http://0.0.0.0:8000/v1/chat/completions' \
   -H 'accept: application/json' \
   -H 'Content-Type: application/json' \
   -d '{
      "model": "meta/llama-3.1-70b-instruct",
      "messages": [{"role":"user", "content":"Write a limerick about the wonders of GPU computing."}],
      "max_tokens": 64
   }'

You should now have a live LLM endpoint at port 8000 for VSS.

Step 2) Deploy Embedding NIM#

VSS requires the llama-3.2-nv-embedqa-1b-v2 embedding NIM to power interactive Q&A. View the NeMo Retriever Embedding documentation for more details on deployment and GPU requirements.

To launch the llama-3.2-nv-embedqa-1b-v2 embedding NIM, modify the snippet below to add your API Key and run the command to launch embedding NIM on GPU 2.

export NGC_API_KEY=<PASTE_API_KEY_HERE>
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
docker run -it --rm \
   --gpus '"device=2"' \
   --shm-size=16GB \
   -e NGC_API_KEY \
   -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
   -u $(id -u) \
   -p 9234:8000 \
   nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2:latest

The first time this command is run, it will take some time to download and deploy the model. Ensure the Embedding NIM works by running a sample curl command.

curl -X "POST" \
"http://localhost:9234/v1/embeddings" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
   "input": ["Hello world"],
   "model": "nvidia/llama-3.2-nv-embedqa-1b-v2",
   "input_type": "query"
}'

You should now have a live embedding endpoint at port 9234 for VSS.

Step 3) Deploy Reranker NIM#

VSS requires the llama-3.2-nv-rerankqa-1b-v2 reranker NIM to power interactive Q&A. View the NeMo Retriever Reranking documentation for more details on deployment and GPU requirements.

To launch the llama-3.2-nv-rerankqa-1b-v2 embedding NIM, modify the snippet below to add your API Key and run the command to launch rerank NIM on GPU 2. (Both the embedding and rerank NIMs can fit on the same H100.)

export NGC_API_KEY=<PASTE_API_KEY_HERE>
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
docker run -it --rm \
   --gpus '"device=2"' \
   --shm-size=16GB \
   -e NGC_API_KEY \
   -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
   -u $(id -u) \
   -p 8000:8000 \
   nvcr.io/nim/nvidia/llama-3.2-nv-rerankqa-1b-v2:latest

The first time this command is run, it will take some time to download and deploy the model. Ensure the reranker NIM works by running a sample curl command.

curl -X "POST" \
"http://localhost:9235/v1/ranking" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "nvidia/llama-3.2-nv-rerankqa-1b-v2",
"query": {"text": "which way did the traveler go?"},
"passages": [
   {"text": "two roads diverged in a yellow wood, and sorry i could not travel both and be one traveler, long i stood and looked down one as far as i could to where it bent in the undergrowth;"},
   {"text": "then took the other, as just as fair, and having perhaps the better claim because it was grassy and wanted wear, though as for that the passing there had worn them really about the same,"},
   {"text": "and both that morning equally lay in leaves no step had trodden black. oh, i marked the first for another day! yet knowing how way leads on to way i doubted if i should ever come back."},
   {"text": "i shall be telling this with a sigh somewhere ages and ages hense: two roads diverged in a wood, and i, i took the one less traveled by, and that has made all the difference."}
],
"truncate": "END"
}'

You should now have a live embedding endpoint at port 9235 for VSS.

Step 4) Deploy VSS#

Once all of the above NIMs have been deployed, you can run the local_deployment example.

cd local_deployment

If you look in the config.yaml file, you will see that the LLM, Reranker and Embedding model configurations have been set to use local endpoints to the NIMs launched in steps 1-3. If you launched these NIMs across different systems, you can modify the base_url parameter to point to another system running the NIM.

Inside the local_deployment folder, edit the .env file and populate the NGC_API_KEY fields. You can see in this file, the VLM is set to VILA-1.5.

After setting your API key in the .env file, you can launch VSS with docker compose.

docker compose up

Once VSS has loaded you can access the UI at port 9100.