Deploy Using Docker Compose#

The Video Search and Summarization Agent blueprint can be deployed through docker compose to allow for more customization and flexibility.

Deployment Components#

To fully launch the VSS blueprint the following parts are required:

LLM

Embedding Model

Reranking Model

VLM

VSS Engine

The helm chart deployment automatically launches all of these parts, however deploying VSS with docker compose, only the VSS Engine is launched which does not include the LLM, Embedding or ReRanker models. These models must launched separately or configured to use remote endpoints.

For the VLM in VSS there are three options:

Use the built in VLM

For the lowest latency experience, VSS supports a built in VLM. When using this VLM, the model is tightly integrated with the video decoding pipeline, leading to higher throughput and lower summarization latency. If configured, VSS will automatically pull the VLM and build an optimized TensorRT engine to run in the video ingestion pipeline. For example, set VLM_MODEL_TO_USE=nvila or VLM_MODEL_TO_USE=vila-1.5 in the .env file along with the required MODEL_PATH, for more details see VLM Model Path.
Use an OpenAI Compatible VLM

Any OpenAI compatible VLM can be used with VSS. This could be a proprietary VLM running in the cloud or a local VLM launched with a third party framework like vLLM or SGLang. For example, set VLM_MODEL_TO_USE=openai-compat in the .env file.
Use a Custom VLM

To use a VLM that does not provide an Open AI compatible REST API interface, you can follow the Custom Models section on how to add support for a new VLM.

Configuration Options#

Some options in the docker compose deployment can be configured.

These options include:

VSS deployment time configurations: VSS Deployment-Time Configuration Glossary.
Changing the default models in the VSS docker compose deployment: Plug-and-Play Overview.
CV Customizations in Docker Compose Deployment.

Deployment Scenarios#

Multi-GPU Deployment#

Three docker compose setups are provided on GitHub to quickly deploy VSS through docker compose on a range of hardware platforms to demonstrate how VSS can be configured to use both local and remote endpoints for the different components.

Deployment Scenario	VLM (VILA-1.5 35B)	LLM (Llama 3.1 70B)	Embedding (llama-3.2-nv-embedqa-1b-v2)	Reranker (llama-3.2-nv-rerankqa-1b-v2)	CV	Audio	Minimum GPU Requirement
Remote VLM Deployment	Remote*	Remote	Remote	Remote	Local	Remote	Minimum 8GB VRAM GPU (without CV) Minimum 16GB VRAM GPU (with CV)
Remote LLM Deployment	Local	Remote	Remote	Remote	Local	Remote	1xH100, 1xA100, 2xL40S
Local Deployment	Local	Local	Local	Local	Local	Local	4xH100, 8xA100, 8xL40S

Deployments are not limited to only the examples provided in the table. You can mix and match local and remote components, distribute components across multiple systems and experiment with different LLMs and VLMs to find a suitable configuration for your hardware.

Note

* Remote VLM deployment uses openai/gpt-4o as the VLM.

Single GPU Deployment (Full Local Deployment)#

A docker compose setup is provided on GitHub to locally deploy VSS using VLM, LLM, Embedding and Reranker models on a single GPU.

Deployment Scenario	VLM (NVILA 15B)	LLM (Llama 3.1 8B)	Embedding (llama-3.2-nv-embedqa-1b-v2)	Reranker (llama-3.2-nv-rerankqa-1b-v2)	Minimum GPU Requirement
Fully Local Deployment: Single GPU	Local	Local	Local	Local	1xH100

In addition to the minimum GPU requirements, the following prerequisites must be met:

Ubuntu 22.04

NVIDIA driver 535.161.08 (Recommended minimum version)

CUDA 12.2+ (CUDA driver installed with NVIDIA driver)

NVIDIA Container Toolkit 1.13.5+

Docker 27.5.1+

Docker Compose 2.32.4

Note

All docker commands on this page should be run without sudo. Ensure you can run docker without sudo by following the docker post installation steps here . Using sudo may break the way environment variables are passed into the container.

To get started, first clone the GitHub repository to get the docker compose samples

git clone https://github.com/NVIDIA-AI-Blueprints/video-search-and-summarization.git
cd video-search-and-summarization/deploy/docker

Log into NGC so all containers are accessible.

docker login nvcr.io

Then supply your NGC API Key .

Username: $oauthtoken
Password: <PASTE_API_KEY_HERE>

Based on your available hardware, follow one of the sections below to launch VSS with docker compose.

Remote VLM Deployment
Remote LLM Deployment
Local Deployment
Fully Local Deployment: Single GPU

Note

For an exhaustive list of all configuration options, see VSS Deployment-Time Configuration Glossary.

Remote VLM Deployment#

The remote_vlm_deployment folder, contains an example of how to launch VSS using remote endpoints for the VLM, LLM, Embedding and Reranker models. This allows VSS to run with minimal hardware requirements. A modern GPU with at least 8GB of VRAM is recommended.

To run this deployment, first get an NVIDIA API key from build.nvidia.com and an OpenAI API Key to use GPT-4o as the remote VLM. Any OpenAI compatible VLM can also be used for this.

cd remote_vlm_deployment

If you look in the config.yaml file, you will see that the LLM, Reranker and Embedding model configurations have been set to use remote endpoints from build.nvidia.com.

Inside the remote_vlm_deployment folder, edit the .env file and populate the NVIDIA_API_KEY, NGC_API_KEY and OPENAI_API_KEY fields. Optionally, the VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME can be adjusted to use a different model. The VIA_VLM_ENDPOINT can also be adjusted to point to any OpenAI compatible VLM. Optionally, to enable the CV pipeline, set DISABLE_CV_PIPELINE=false and INSTALL_PROPRIETARY_CODECS=true. For example:

#.env file

NVIDIA_API_KEY=nvapi-***
OPENAI_API_KEY=def456***
NGC_API_KEY=abc123***
#VIA_VLM_ENDPOINT=http://192.168.1.100:8000 # Optional
#VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME=gpt-4o # Optional
DISABLE_CV_PIPELINE=true # Set to false to enable CV
INSTALL_PROPRIETARY_CODECS=false # Set to true to enable CV
.
.
.

Warning

The .env file will store your private API keys in plain text. Ensure proper security and permission settings are in place for this file or use a secrets manager to pass API key environment variables in a production environment.

After setting your API keys in the .env file, you can launch VSS with docker compose.

docker compose up

Once VSS has loaded you can access the UI at port 9100.

Stopping the Deployment#

To stop any of the docker compose deployments, run the following command from the deployment directory:

docker compose down

This will stop and remove all containers created by docker compose.

Remote LLM Deployment#

The remote_llm_deployment folder, contains an example of how to launch VSS using the local built in VLM and remote endpoints for the LLM, Embedding and ReRanker models. This requires at least 80GB of VRAM to load the built in VILA-1.5 35B model. This can be run on systems with 2xL40s, 1xA100 80GB or 1xH100 GPUs.

cd remote_llm_deployment

If you look in the config.yaml file, you will see that the LLM, Reranker and Embedding model configurations have been set to use remote endpoints from build.nvidia.com.

Inside the remote_llm_deployment folder, edit the .env file and populate the NVIDIA_API_KEY and NGC_API_KEY fields. You can see in this file, the VLM is set to VILA-1.5. Optionally, to enable the CV pipeline, set DISABLE_CV_PIPELINE=false and INSTALL_PROPRIETARY_CODECS=true.

#.env file

NVIDIA_API_KEY=abc123***
NGC_API_KEY=def456***
DISABLE_CV_PIPELINE=true # Set to false to enable CV
INSTALL_PROPRIETARY_CODECS=false # Set to true to enable CV
.
.
.

After setting your API keys in the .env file, you can launch VSS with docker compose.

docker compose up

Once VSS has loaded you can access the UI at port 9100.

Stopping the Deployment#

To stop any of the docker compose deployments, run the following command from the deployment directory:

docker compose down

This will stop and remove all containers created by docker compose.

Local Deployment#

The local_deployment folder contains an example of how to launch VSS using the local built in VLM and local endpoints for the LLM, Embedding and Reranker models.

Create docker network for the local VSS deployment.

docker system prune
docker network create via-engine-${USER}

For details on minimum GPU requirement, please check: Deployment Scenarios.

The example configuration below assumes a 4xH100 deployment where

GPU 0 is used for VLM
GPUs 1,2 are used for LLM
GPU 3 is used for Embedding, Reranker and RIVA ASR

To run this example, you must first launch local instances of the following components:

LLM
Embedding
Reranker

Step 1: Deploy LLM#

VSS can be configured to point to any OpenAI compatible LLM. This could be an LLM running on your local system through a NIM or a proprietary LLM deployed in the cloud. To deploy an LLM, the Llama 3.1 70B NIM is recommended. The LLM could also be served by a third party framework such as vLLM or SGLang.

For summarization only use cases, models as small as Llama 3.1 8B can work well. However, for interactive Q&A, larger models are recommended to properly interface with the Graph Database.

Go to build.nvidia.com to explore available LLM NIMs. Each LLM NIM will have a “Deploy” tab with an example docker run command to launch the NIM. LLM NIMs have different hardware requirements depending on the model’s size. Visit the LLM NIM Documentation for more details on deployment and GPU requirements.

To launch the recommended Llama 3.1 70B NIM, first login to nvcr.io.

docker login nvcr.io

Then supply your API Key from build.nvidia.com

Username: $oauthtoken
Password: <PASTE_API_KEY_HERE>

Once logged, modify the snippet below to add your API Key and run the command to launch the Llama 3.1 70B NIM on GPUs 1 and 2.

export NGC_API_KEY=<PASTE_API_KEY_HERE>
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
docker run -d -it \
--gpus '"device=1,2"' \
--shm-size=16GB \
-e NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
nvcr.io/nim/meta/llama-3.1-70b-instruct:1.3.3

The first time this command is run, it will take some time to download and deploy the model. Ensure the LLM works by running a sample curl command.

curl -X 'POST' \
   'http://0.0.0.0:8000/v1/chat/completions' \
   -H 'accept: application/json' \
   -H 'Content-Type: application/json' \
   -d '{
      "model": "meta/llama-3.1-70b-instruct",
      "messages": [{"role":"user", "content":"Write a limerick about the wonders of GPU computing."}],
      "max_tokens": 64
   }'

You should now have a live LLM endpoint at port 8000 for VSS.

Step 2) Deploy Embedding NIM#

VSS requires the llama-3.2-nv-embedqa-1b-v2 embedding NIM to power interactive Q&A. View the NeMo Retriever Embedding documentation for more details on deployment and GPU requirements.

To launch the llama-3.2-nv-embedqa-1b-v2 embedding NIM, modify the snippet below to add your API Key and run the command to launch embedding NIM on GPU 3.

export NGC_API_KEY=<PASTE_API_KEY_HERE>
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
docker run -d -it \
   --gpus '"device=3"' \
   --shm-size=16GB \
   -e NGC_API_KEY \
   -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
   -u $(id -u) \
   -p 9234:8000 \
   nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2:1.3.0

The first time this command is run, it will take some time to download and deploy the model. Ensure the Embedding NIM works by running a sample curl command.

curl -X "POST" \
"http://0.0.0.0:9234/v1/embeddings" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
   "input": ["Hello world"],
   "model": "nvidia/llama-3.2-nv-embedqa-1b-v2",
   "input_type": "query"
}'

You should now have a live embedding endpoint at port 9234 for VSS.

Step 3) Deploy Reranker NIM#

VSS requires the llama-3.2-nv-rerankqa-1b-v2 reranker NIM to power interactive Q&A. View the NeMo Retriever Reranking documentation for more details on deployment and GPU requirements.

To launch the llama-3.2-nv-rerankqa-1b-v2 embedding NIM, modify the snippet below to add your API Key and run the command to launch rerank NIM on GPU 3. (Both the embedding and rerank NIMs can fit on the same H100.)

export NGC_API_KEY=<PASTE_API_KEY_HERE>
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
docker run -d -it \
   --gpus '"device=3"' \
   --shm-size=16GB \
   -e NGC_API_KEY \
   -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
   -u $(id -u) \
   -p 9235:8000 \
   nvcr.io/nim/nvidia/llama-3.2-nv-rerankqa-1b-v2:1.3.0

The first time this command is run, it will take some time to download and deploy the model. Ensure the reranker NIM works by running a sample curl command.

curl -X "POST" \
"http://0.0.0.0:9235/v1/ranking" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "nvidia/llama-3.2-nv-rerankqa-1b-v2",
"query": {"text": "which way did the traveler go?"},
"passages": [
   {"text": "two roads diverged in a yellow wood, and sorry i could not travel both and be one traveler, long i stood and looked down one as far as i could to where it bent in the undergrowth;"},
   {"text": "then took the other, as just as fair, and having perhaps the better claim because it was grassy and wanted wear, though as for that the passing there had worn them really about the same,"},
   {"text": "and both that morning equally lay in leaves no step had trodden black. oh, i marked the first for another day! yet knowing how way leads on to way i doubted if i should ever come back."},
   {"text": "i shall be telling this with a sigh somewhere ages and ages hense: two roads diverged in a wood, and i, i took the one less traveled by, and that has made all the difference."}
],
"truncate": "END"
}'

You should now have a live embedding endpoint at port 9235 for VSS.

Step 4) Deploy RIVA ASR NIM (Optional)#

VSS requires the RIVA ASR NIM to transcribe audio. Set ENABLE_AUDIO=true in the .env file to enable audio transcription.

Note

Please refer to the section: Using Riva ASR NIM from build.nvidia.com for obtaining RIVA_ASR_SERVER_API_KEY to be set in the .env file.

To launch the RIVA ASR NIM, modify the snippet below to add your API Key and run the command to launch the RIVA ASR NIM on GPU 3.

export NGC_API_KEY=<PASTE_API_KEY_HERE>
export CONTAINER_NAME=parakeet-ctc-asr

docker run -d -it --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus '"device=3"' \
--shm-size=8GB \
-e NGC_API_KEY=$NGC_API_KEY \
-e NIM_HTTP_API_PORT=9000 \
-e NIM_GRPC_API_PORT=50051 \
-e NIM_TAGS_SELECTOR=name=parakeet-0-6b-ctc-riva-en-us,mode=all  \
--network=via-engine-${USER}  \
nvcr.io/nim/nvidia/parakeet-0-6b-ctc-en-us:2.0.0

You should now have a live RIVA ASR endpoint at port 9000 for VSS.

Wait for the Riva ASR service is ready.

RIVA_IP=$(docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' parakeet-ctc-asr)
curl -X 'GET' http://${RIVA_IP}:9000/v1/health/ready

Step 5) Enable CV Pipeline (Optional)#

To enable the CV pipeline, set DISABLE_CV_PIPELINE=false and INSTALL_PROPRIETARY_CODECS=true in the .env file. This will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.

DISABLE_CV_PIPELINE=false
INSTALL_PROPRIETARY_CODECS=true

Step 6) Deploy VSS#

Once all of the above NIMs have been deployed, you can run the local_deployment example.

cd local_deployment

If you look in the config.yaml file, you will see that the LLM, Reranker and Embedding model configurations have been set to use local endpoints to the NIMs launched in steps 1-3. If you launched these NIMs across different systems, you can modify the base_url parameter to point to another system running the NIM.

Inside the local_deployment folder, edit the .env file and populate the NGC_API_KEY fields. You can see in this file, the VLM is set to VILA-1.5.

#.env file

NGC_API_KEY=abc123***
.
.
.

After setting your API key in the .env file, you can launch VSS with docker compose.

docker compose up

Once VSS has loaded you can access the UI at port 9100 and the backend REST API endpoint at port 8100.

Next, test the deployment by summarizing a sample video.

Stopping the Deployment#

To stop any of the docker compose deployments, run the following command from the deployment directory:

docker compose down

This will stop and remove all containers created by docker compose.

Fully Local Deployment: Single GPU#

Single GPU Deployment recipe using non-default low memory modes and smaller LLMs verified on 1XH100, 1XH200, 1XA100 (80GB+, HBM) machine is available below.

The local_deployment_single_gpu folder contains an example of how to launch VSS in a single GPU. This deployment downloads and runs the VLM, LLM, Embedding, and Reranker models locally on one single GPU. The example configuration assumes a 1xH100 (80GB) deployment.

To run this example, you must first launch local instances of the following components:

LLM
Embedding
Reranker

Note

CV and audio related features are currently not supported in Single GPU deployment.

Step 1: Deploy LLM#

VSS can be configured to point to any OpenAI compatible LLM. This could be an LLM running on your local system through a NIM or a proprietary LLM deployed in the cloud. To deploy an LLM, the Llama 3.1 8B NIM is recommended. The LLM could also be served by a third party framework such as vLLM or SGLang.

Go to build.nvidia.com to explore available LLM NIMs and the LLM NIM Documentation for more details on deployment and GPU requirements.

To launch the recommended Llama 3.1 8B NIM, first login to nvcr.io.

docker login nvcr.io

Then supply your API Key from build.nvidia.com

Username: $oauthtoken
Password: <PASTE_API_KEY_HERE>

Once logged, modify the snippet below to add your API Key and run the command to launch the Llama 3.1 8B NIM on GPU 0. For single GPU deployment, the NIM_LOW_MEMORY_MODE and NIM_RELAX_MEM_CONSTRAINTS environment variables are required to start LLM NIM in low memory mode.

export NGC_API_KEY=<PASTE_API_KEY_HERE>
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
docker run -d -u $(id -u)  -it \
--gpus '"device=0"' --shm-size=16GB       \
-e NGC_API_KEY=$NGC_API_KEY       \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-p 8007:8000 -e NIM_LOW_MEMORY_MODE=1 -e NIM_RELAX_MEM_CONSTRAINTS=1 \
nvcr.io/nim/meta/llama-3.1-8b-instruct:1.3.3

You should now have a live LLM endpoint at port 8007 for VSS.

Step 2) Deploy Embedding NIM#

VSS requires the llama-3.2-nv-embedqa-1b-v2 embedding NIM to power interactive Q&A. View the NeMo Retriever Embedding documentation for more details on deployment and GPU requirements.

To launch the llama-3.2-nv-embedqa-1b-v2 embedding NIM, modify the snippet below to add your API Key and run the command to launch embedding NIM on GPU 0.

export NGC_API_KEY=<PASTE_API_KEY_HERE>
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
docker run -d -u $(id -u)  -it \
--gpus '"device=0"' --shm-size=16GB \
-e NGC_API_KEY=$NGC_API_KEY  \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache"   \
-p 8006:8000 -e NIM_SERVER_PORT=8000 \
-e NIM_MODEL_PROFILE="f7391ddbcb95b2406853526b8e489fedf20083a2420563ca3e65358ff417b10f" \
nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2:1.3.0

You should now have a live embedding endpoint at port 8006 for VSS.

Step 3) Deploy Reranker NIM#

VSS requires the llama-3.2-nv-rerankqa-1b-v2 reranker NIM to power interactive Q&A. View the NeMo Retriever Reranking documentation for more details on deployment and GPU requirements.

To launch the llama-3.2-nv-rerankqa-1b-v2 embedding NIM, modify the snippet below to add your API Key and run the command to launch rerank NIM on GPU 0.

export NGC_API_KEY=<PASTE_API_KEY_HERE>
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
docker run -d -u $(id -u)  -it \
--gpus '"device=0"' --shm-size=16GB \
-e NGC_API_KEY=$NGC_API_KEY  \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-p 8005:8000 -e NIM_SERVER_PORT=8000 \
-e NIM_MODEL_PROFILE="f7391ddbcb95b2406853526b8e489fedf20083a2420563ca3e65358ff417b10f" \
nvcr.io/nim/nvidia/llama-3.2-nv-rerankqa-1b-v2:1.3.0

You should now have a live embedding endpoint at port 8005 for VSS.

Step 4) Deploy VSS#

Once all of the above NIMs have been deployed, you can run the local_deployment_single_gpu example.

cd local_deployment_single_gpu

If you look in the config.yaml file, you will see that the LLM, Reranker and Embedding model configurations have been set to use local endpoints to the NIMs launched in steps 1-3. If you launched these NIMs across different systems, you can modify the base_url parameter to point to another system running the NIM.

Inside the local_deployment_single_gpu folder, edit the .env file and populate the NGC_API_KEY fields. You can see in this file, the VLM is set to NVILA-15B.

#.env file

NGC_API_KEY=abc123***
.
.
.

After setting your API key in the .env file, you can launch VSS with docker compose.

docker compose up

VSS is ready when you see below logs:

Once VSS has loaded you can access the UI at port 9100 and the backend REST API endpoint at port 8100.

Next, test the deployment by summarizing a sample video.

Stopping the Deployment#

To stop any of the docker compose deployments, run the following command from the deployment directory:

docker compose down

This will stop and remove all containers created by docker compose.

Note

To enable the live stream preview, set INSTALL_PROPRIETARY_CODECS=true in the .env file. To get the logs from containers running in detached mode, run docker logs <container_id>. To remove the containers, run docker rm <container_id>.

Configuring GPU Allocation#

To customize the docker compose deployment for various GPU configurations:

For VSS container, modify the NVIDIA_VISIBLE_DEVICES variable in the .env file .
For other containers launched using the docker run command, modify the --gpus '"device=0"' argument.

The GPU device ids must be set based on:

The number of GPUs available on the system.
GPU requirements for each service.
- When using VILA-1.5 VLM, VSS requires at least 1 GPU on an 80+ GB GPU (A100, H100, H200) and at least 2 GPUs on an 48 GB GPU (L40s).
- When using NVILA VLM or a remote VLM endpoint, VSS requires at least 1 GPU (A100, H100, H200, L40s).
- Embedding and Reranking require 1 GPU each but can share a GPU with VSS on an 80+ GB GPU.
- Check NVIDIA NIM for Large Language Models (LLMs) documentation for LLM GPU requirements.
GPUs can be shared even further by using the low memory modes and smaller VLM and LLM models as shown in Fully Local Deployment: Single GPU.
If using remote endpoints for any of the services, GPUs will not be used by these services and GPU requirements will be further reduced.