Is this page helpful?

Deploy Using Docker Compose X86#

The Video Search and Summarization Agent blueprint can be deployed through Docker Compose to allow for more customization and flexibility.

Prerequisites#

In addition to the minimum GPU requirements as per the deployment scenarios described below, the following prerequisites must be met:

Ubuntu 22.04
NVIDIA driver 580.65.06 (Recommended minimum version)
CUDA 13.0+ (CUDA driver installed with NVIDIA driver)
NVIDIA Container Toolkit 1.13.5+
Docker 27.5.1+
Docker Compose 2.32.4

Deployment Components#

To fully launch the VSS blueprint the following parts are required:

LLM

Embedding Model

Reranking Model

VLM

VSS Engine

The Helm Chart deployment automatically launches all of these parts, however deploying VSS with Docker Compose, only the VSS Engine is launched, which does not include the LLM, Embedding or ReRanker models. These models must launched separately or configured to use remote endpoints.

For the VLM in VSS there are three options:

Use the built-in VLM

For the lowest latency experience, VSS supports a built-in VLM. When using this VLM, the model is tightly integrated with the video decoding pipeline, leading to higher throughput and lower summarization latency. If configured, VSS will automatically pull the VLM and built-in optimized TensorRT engine to run in the video ingestion pipeline. For example, set VLM_MODEL_TO_USE=cosmos-reason2 or VLM_MODEL_TO_USE=vllm-compatible or VLM_MODEL_TO_USE=cosmos-reason1 in the .env file along with the required MODEL_PATH, for more details refer to VLM Model Path.
Use an OpenAI Compatible VLM

Any OpenAI compatible VLM can be used with VSS. This could be a proprietary VLM running in the cloud or a local VLM launched with a third party framework like vLLM or SGLang. For example, set VLM_MODEL_TO_USE=openai-compat in the .env file.
Use a Custom VLM

To use a VLM that does not provide an Open AI compatible REST API interface, you can follow the Custom Models section on how to add support for a new VLM.

Configuration Options#

Some options in the Docker Compose deployment can be configured.

These options include:

VSS deployment time configurations: VSS Deployment-Time Configuration Glossary.
Changing the default models in the VSS Docker Compose deployment: Plug-and-Play Overview.
CV Customizations in Docker Compose Deployment.

Deployment Scenarios#

Multi-GPU Deployment#

Three Docker Compose setups are provided on GitHub to deploy VSS through Docker Compose on a range of hardware platforms to demonstrate how VSS can be configured to use both local and remote endpoints for the different components.

Deployment Scenario	VLM (Cosmos-Reason2-8B)	LLM (Llama 3.1 70B)	Embedding (llama-3.2-nv-embedqa-1b-v2)	Reranker (llama-3.2-nv-rerankqa-1b-v2)	CV	Audio	Minimum GPU Requirement
Remote Deployment	Remote*	Remote	Remote	Remote	Local	Remote	Minimum 8GB VRAM GPU (without CV), Minimum 16GB VRAM GPU (with CV)
Hybrid Deployment	Local	Remote	Remote	Remote	Local	Remote	1xB200, 1xH100, 1xA100, 1xL40S, 1xRTX PRO 6000 Blackwell SE
Local Deployment	Local	Local	Local	Local	Local	Local	4xB200, 4xH100, 8xA100, 8xL40S 4xRTX PRO 6000 Blackwell SE

Deployments are not limited to only the examples provided in the table. You can mix and match local and remote components, distribute components across multiple systems and experiment with different LLMs and VLMs to find a suitable configuration for your hardware.

Note

Remote Deployment uses openai/gpt-4o as the VLM.

Single GPU Deployment (Full Local Deployment)#

A Docker Compose setup is provided on GitHub to locally deploy VSS using VLM, LLM, Embedding, and Reranker models on a single GPU.

Deployment Scenario	VLM (Cosmos-Reason2-8B)	LLM (Llama 3.1 8B)	Embedding (llama-3.2-nv-embedqa-1b-v2)	Reranker (llama-3.2-nv-rerankqa-1b-v2)	Minimum GPU Requirement
Fully Local Deployment: Single GPU	Local	Local	Local	Local	1xH100, 1xRTX PRO 6000 Blackwell (SE)

Note

All Docker commands on this page must be run without sudo. Ensure you can run Docker without sudo by following the Docker post installation. Using sudo can break the way environment variables are passed into the container.

Note

To deploy the Cosmos-Reason2 8B model (default model) from Hugging Face, you need to accept the terms and conditions of the model from the page Cosmos-Reason2 8B page.

To get started, clone the GitHub repository to get the Docker Compose samples

git clone https://github.com/NVIDIA-AI-Blueprints/video-search-and-summarization.git
cd video-search-and-summarization/deploy/docker

Log into NGC so all containers are accessible.

docker login nvcr.io

Supply your NGC API Key.

Username: $oauthtoken
Password: <PASTE_API_KEY_HERE>

Based on your available hardware, follow one of the sections below to launch VSS with Docker Compose.

Remote Deployment
Hybrid Deployment
Local Deployment
Fully Local Deployment: Single GPU

Note

For a list of all configuration options, refer to VSS Deployment-Time Configuration Glossary.

Remote Deployment#

The remote_vlm_deployment folder, contains an example of how to launch VSS using remote endpoints for the VLM, LLM, Embedding, and Reranker models. This allows VSS to run with minimal hardware requirements. A modern GPU with at least 8GB of VRAM is recommended.

To run this deployment, get an NVIDIA API key from build.nvidia.com and an OpenAI API Key to use GPT-4o as the remote VLM. Any OpenAI compatible VLM can also be used for this.

cd remote_vlm_deployment

If you look in the config.yaml file, you will observe that the LLM, Reranker, and Embedding model configurations have been set to use remote endpoints from build.nvidia.com.

Inside the remote_vlm_deployment folder, edit the .env file and populate the NVIDIA_API_KEY, NGC_API_KEY, HF_TOKEN and OPENAI_API_KEY fields. Optionally, the VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME can be adjusted to use a different model. The VIA_VLM_ENDPOINT can also be adjusted to point to any OpenAI compatible VLM. Optionally, to enable the CV pipeline, set DISABLE_CV_PIPELINE=false and INSTALL_PROPRIETARY_CODECS=true. For example:

#.env file

NVIDIA_API_KEY=nvapi-***
OPENAI_API_KEY=def456***
NGC_API_KEY=abc123***
HF_TOKEN=hf_***
#VIA_VLM_ENDPOINT=http://192.168.1.100:8000 # Optional
#VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME=gpt-4o # Optional
DISABLE_CV_PIPELINE=true # Set to false to enable CV
INSTALL_PROPRIETARY_CODECS=false # Set to true to enable CV
.
.
.

Warning

The .env file will store your private API keys in plain text. Ensure proper security and permission settings are in place for this file or use a secrets manager to pass API key environment variables in a production environment.

After setting your API keys in the .env file, you can launch VSS with Docker Compose.

docker compose up

After VSS has loaded you can access the UI at port 9100.

Stopping the Deployment#

To stop any of the Docker Compose deployments, run the following command from the deployment directory:

docker compose down

This will stop and remove all containers created by Docker Compose.

Hybrid Deployment#

The remote_llm_deployment folder, contains an example of how to launch VSS using the local built-in VLM and remote endpoints for the LLM, Embedding, and ReRanker models. This requires at least 40GB of VRAM to load the built-in Cosmos-Reason2 model. This can be run on systems with 1xL40s, 1xA100 80GB, 1xH100, 1xB200 or 1xRTX PRO 6000 Blackwell SE GPUs.

cd remote_llm_deployment

If you look in the config.yaml file, you will observe that the LLM, Reranker, and Embedding model configurations have been set to use remote endpoints from build.nvidia.com.

Inside the remote_llm_deployment folder, edit the .env file and populate the NVIDIA_API_KEY, HF_TOKEN and NGC_API_KEY fields. You can observe in this file, the VLM is set to Cosmos-Reason2. Optionally, to enable the CV pipeline, set DISABLE_CV_PIPELINE=false and INSTALL_PROPRIETARY_CODECS=true.

#.env file

NVIDIA_API_KEY=abc123***
NGC_API_KEY=def456***
HF_TOKEN=hf_***
DISABLE_CV_PIPELINE=true # Set to false to enable CV
INSTALL_PROPRIETARY_CODECS=false # Set to true to enable CV
.
.
.

After setting your API keys in the .env file, you can launch VSS with Docker Compose.

docker compose up

After VSS has loaded you can access the UI at port 9100.

Stopping the Deployment#

To stop any of the Docker Compose deployments, run the following command from the deployment directory:

docker compose down

This will stop and remove all containers created by Docker Compose.

Local Deployment#

The local_deployment folder contains an example of how to launch VSS using the local built-in VLM and local endpoints for the LLM, Embedding and Reranker models.

Create Docker network for the local VSS deployment.

docker system prune
docker network create via-engine-${USER}

For details on minimum GPU requirement, check: Deployment Scenarios.

The example configuration below assumes a 4xH100 deployment where:

GPU 0 is used for VLM

GPUs 1,2 are used for LLM

GPU 3 is used for Embedding, Reranker and RIVA ASR

To run this example, you must launch local instances of the following components:

LLM

Embedding

Reranker

Deploy LLM#

VSS can be configured to point to any OpenAI compatible LLM. This could be an LLM running on your local system through a NIM or a proprietary LLM deployed in the cloud. To deploy an LLM, the Llama 3.1 70B NIM is recommended. The LLM could also be served by a third party framework such as vLLM or SGLang.

For summarization only use cases, models as small as Llama 3.1 8B can work well. However, for interactive Q&A, larger models are recommended to properly interface with the Graph Database.

Go to build.nvidia.com to explore available LLM NIMs. Each LLM NIM will have a Deploy tab with an example Docker run command to launch the NIM. LLM NIMs have different hardware requirements depending on the model’s size. Visit the LLM NIM Documentation for more details on deployment and GPU requirements.

To launch the recommended Llama 3.1 70B NIM:

Login to nvcr.io.
```
docker login nvcr.io
```

Supply your API Key from build.nvidia.com

Username: $oauthtoken
Password: <PASTE_API_KEY_HERE>

Modify the snippet below to add your API key and run the command to launch the Llama 3.1 70B NIM on GPUs 1 and 2:

Note

For optimal performance on specific hardware platforms, consider using hardware-optimized NIM model profiles. Profiles can vary depending on the number of GPUs and the hardware platform. See NIM Model Profile Optimization for detailed guidance on profile selection and configuration. IMPORTANT: RTX PRO 6000 users MUST use NIM version 1.13.1 and specific profile for llama-3.1-70b-instruct - this is mandatory, not optional.

[Skip for other platforms] Only for RTX Pro 6000 Blackwell, please set the following environment variables:

#only for RTX Pro 6000 Blackwell:
export NIM_LLM_TAG=1.13.1
export NIM_MODEL_PROFILE="22bf424d6572fda243f890fe1a7c38fb0974c3ab27ebbbc7e2a2848d7af82bd6" #2 GPU profile

Launch the Llama 3.1 70B NIM:

export NGC_API_KEY=<PASTE_API_KEY_HERE>
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
docker run -d -it \
   --gpus '"device=1,2"' \
   --shm-size=16GB \
   -e NGC_API_KEY \
   -e NIM_MODEL_PROFILE \
   -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
   -u $(id -u) \
   -p 8000:8000 \
   nvcr.io/nim/meta/llama-3.1-70b-instruct:${NIM_LLM_TAG:-1.10.1}

The first time this command is run, it will take some time to download and deploy the model. Ensure the LLM works by running a sample CURL command.

curl -X 'POST' \
   'http://0.0.0.0:8000/v1/chat/completions' \
   -H 'accept: application/json' \
   -H 'Content-Type: application/json' \
   -d '{
      "model": "meta/llama-3.1-70b-instruct",
      "messages": [{"role":"user", "content":"Write a limerick about the wonders of GPU computing."}],
      "max_tokens": 64
   }'

Verify that you have a live LLM endpoint at port 8000 for VSS.

Deploy Embedding NIM#

VSS requires the llama-3.2-nv-embedqa-1b-v2 embedding NIM to power interactive Q&A. View the NeMo Retriever Embedding documentation for more details on deployment and GPU requirements.

To launch the llama-3.2-nv-embedqa-1b-v2 embedding NIM, modify the snippet below to add your API Key and run the command to launch embedding NIM on GPU 3.

export NGC_API_KEY=<PASTE_API_KEY_HERE>
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
docker run -d -it \
   --gpus '"device=3"' \
   --shm-size=16GB \
   -e NGC_API_KEY \
   -e NIM_TRT_ENGINE_HOST_CODE_ALLOWED=1 \
   -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
   -u $(id -u) \
   -p 9234:8000 \
   nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2:1.9.0

The first time this command is run, it will take some time to download and deploy the model. Ensure the Embedding NIM works by running a sample CURL command:

curl -X "POST" \
"http://0.0.0.0:9234/v1/embeddings" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
   "input": ["Hello world"],
   "model": "nvidia/llama-3.2-nv-embedqa-1b-v2",
   "input_type": "query"
}'

Verify that you have a live embedding endpoint at port 9234 for VSS.

Deploy Reranker NIM#

VSS requires the llama-3.2-nv-rerankqa-1b-v2 reranker NIM to power interactive Q&A. View the NeMo Retriever Reranking documentation for more details on deployment and GPU requirements.

To launch the llama-3.2-nv-rerankqa-1b-v2 embedding NIM, modify the snippet below to add your API key and run the command to launch rerank NIM on GPU 3. (Both the embedding and rerank NIMs can fit on the same H100.)

export NGC_API_KEY=<PASTE_API_KEY_HERE>
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
docker run -d -it \
   --gpus '"device=3"' \
   --shm-size=16GB \
   -e NGC_API_KEY \
   -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
   -u $(id -u) \
   -p 9235:8000 \
   nvcr.io/nim/nvidia/llama-3.2-nv-rerankqa-1b-v2:1.7.0

The first time this command is run, it will take some time to download and deploy the model. Ensure the reranker NIM works by running a sample curl command.

curl -X "POST" \
"http://0.0.0.0:9235/v1/ranking" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "nvidia/llama-3.2-nv-rerankqa-1b-v2",
"query": {"text": "which way did the traveler go?"},
"passages": [
   {"text": "two roads diverged in a yellow wood, and sorry i could not travel both and be one traveler, long i stood and looked down one as far as i could to where it bent in the undergrowth;"},
   {"text": "then took the other, as just as fair, and having perhaps the better claim because it was grassy and wanted wear, though as for that the passing there had worn them really about the same,"},
   {"text": "and both that morning equally lay in leaves no step had trodden black. oh, i marked the first for another day! yet knowing how way leads on to way i doubted if i should ever come back."},
   {"text": "i shall be telling this with a sigh somewhere ages and ages hense: two roads diverged in a wood, and i, i took the one less traveled by, and that has made all the difference."}
],
"truncate": "END"
}'

Verify that you have a live embedding endpoint at port 9235 for VSS.

Deploy RIVA ASR NIM (Optional)#

VSS requires the RIVA ASR NIM to transcribe audio. Set ENABLE_AUDIO=true in the .env file to enable audio transcription.

Note

Refer to the section: Using Riva ASR NIM from build.nvidia.com for obtaining RIVA_ASR_SERVER_API_KEY to be set in the .env file.

To launch the RIVA ASR NIM, modify the snippet below to add your API Key and run the command to launch the RIVA ASR NIM on GPU 3:

export NGC_API_KEY=<PASTE_API_KEY_HERE>
export CONTAINER_NAME=parakeet-ctc-asr

docker run -d -it --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus '"device=3"' \
--shm-size=8GB \
-e NGC_API_KEY=$NGC_API_KEY \
-e NIM_HTTP_API_PORT=9000 \
-e NIM_GRPC_API_PORT=50051 \
-e NIM_TAGS_SELECTOR=name=parakeet-0-6b-ctc-riva-en-us,mode=all  \
--network=via-engine-${USER}  \
nvcr.io/nim/nvidia/parakeet-0-6b-ctc-en-us:2.0.0

Verify that you have a live RIVA ASR endpoint at port 9000 for VSS.

Wait for the Riva ASR service to be ready.

RIVA_IP=$(docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' parakeet-ctc-asr)
curl -X 'GET' http://${RIVA_IP}:9000/v1/health/ready

Enable CV Pipeline (Optional)#

To enable the CV pipeline, set DISABLE_CV_PIPELINE=false and INSTALL_PROPRIETARY_CODECS=true in the .env file. This will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.

DISABLE_CV_PIPELINE=false
INSTALL_PROPRIETARY_CODECS=true

Deploy VSS#

After all of the NIMs have been deployed, you can run the local_deployment example from here .

cd local_deployment

If you look in the config.yaml file, you will observe that the LLM, Reranker, and Embedding model configurations have been set to use local endpoints to the NIMs launched in steps 1-3. If you launched these NIMs across different systems, you can modify the base_url parameter to point to another system running the NIM.

Inside the local_deployment folder, edit the .env file and populate the NGC_API_KEY and HF_TOKEN fields. You can observe in this file, the VLM is set to Cosmos-Reason2.

#.env file

NGC_API_KEY=abc123***
HF_TOKEN=hf_***
.
.
.

After setting your API key in the .env file, you can launch VSS with Docker Compose.

docker compose up

After VSS has loaded you can access the UI at port 9100 and the backend REST API endpoint at port 8100.

Test the deployment by summarizing a sample video.

Stopping the Deployment#

To stop any of the Docker Compose deployments, run the following command from the deployment directory:

docker compose down

This will stop and remove all containers created by Docker Compose.

Fully Local Deployment: Single GPU#

Single GPU Deployment recipe using non-default low memory modes and smaller LLMs verified on 1xB200, 1XH200, 1XA100 (80GB+, HBM) machine is available below.

The local_deployment_single_gpu folder contains an example of how to launch VSS in a single GPU. This deployment downloads and runs the VLM, LLM, Embedding, and Reranker models locally on one single GPU. The example configuration assumes a 1xH100 (80GB) deployment.

To run this example, you must launch local instances of the following components:

LLM
Embedding
Reranker

Note

CV and audio related features are currently not supported in Single GPU deployment.

Deploy LLM#

VSS can be configured to point to any OpenAI compatible LLM. This could be an LLM running on your local system through a NIM or a proprietary LLM deployed in the cloud. To deploy an LLM, the Llama 3.1 8B NIM is recommended. The LLM could also be served by a third party framework such as vLLM or SGLang.

Go to build.nvidia.com to explore available LLM NIMs and the LLM NIM Documentation for more details on deployment and GPU requirements.

To launch the recommended Llama 3.1 8B NIM, first login to nvcr.io:

docker login nvcr.io

Then supply your API Key from build.nvidia.com:

Username: $oauthtoken
Password: <PASTE_API_KEY_HERE>

After logged, modify the snippet below to add your API Key and run the command to launch the Llama 3.1 8B NIM on GPU 0. For single GPU deployment, the NIM_LOW_MEMORY_MODE and NIM_RELAX_MEM_CONSTRAINTS environment variables are required to start LLM NIM in low memory mode.

export NGC_API_KEY=<PASTE_API_KEY_HERE>
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
docker run -d -u $(id -u)  -it \
--gpus '"device=0"' --shm-size=16GB       \
-e NGC_API_KEY=$NGC_API_KEY       \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-p 8007:8000 -e NIM_LOW_MEMORY_MODE=1 -e NIM_RELAX_MEM_CONSTRAINTS=1 \
nvcr.io/nim/meta/llama-3.1-8b-instruct:1.12.0

Verify that you have a live LLM endpoint at port 8007 for VSS.

Deploy Embedding NIM#

VSS requires the llama-3.2-nv-embedqa-1b-v2 embedding NIM to power interactive Q&A. View the NeMo Retriever Embedding documentation for more details on deployment and GPU requirements.

To launch the llama-3.2-nv-embedqa-1b-v2 embedding NIM, modify the snippet below to add your API Key and run the command to launch embedding NIM on GPU 0.

export NGC_API_KEY=<PASTE_API_KEY_HERE>
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
docker run -d -u $(id -u)  -it \
--gpus '"device=0"' --shm-size=16GB \
-e NGC_API_KEY=$NGC_API_KEY  \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache"   \
-p 8006:8000 -e NIM_SERVER_PORT=8000 \
-e NIM_MODEL_PROFILE="f7391ddbcb95b2406853526b8e489fedf20083a2420563ca3e65358ff417b10f" \
-e NIM_TRT_ENGINE_HOST_CODE_ALLOWED=1 \
nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2:1.9.0

Verify that you have a live embedding endpoint at port 8006 for VSS.

Deploy Reranker NIM#

VSS requires the llama-3.2-nv-rerankqa-1b-v2 reranker NIM to power interactive Q&A. View the NeMo Retriever Reranking documentation for more details on deployment and GPU requirements.

To launch the llama-3.2-nv-rerankqa-1b-v2 embedding NIM, modify the snippet below to add your API Key and run the command to launch rerank NIM on GPU 0.

export NGC_API_KEY=<PASTE_API_KEY_HERE>
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
docker run -d -u $(id -u)  -it \
--gpus '"device=0"' --shm-size=16GB \
-e NGC_API_KEY=$NGC_API_KEY  \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-p 8005:8000 -e NIM_SERVER_PORT=8000 \
-e NIM_MODEL_PROFILE="f7391ddbcb95b2406853526b8e489fedf20083a2420563ca3e65358ff417b10f" \
nvcr.io/nim/nvidia/llama-3.2-nv-rerankqa-1b-v2:1.7.0

You should now have a live embedding endpoint at port 8005 for VSS.

Deploy VSS#

After all of the NIMs have been deployed, you can run the local_deployment_single_gpu example from here .

cd local_deployment_single_gpu

If you look in the config.yaml file, you will observe that the LLM, Reranker, and Embedding model configurations have been set to use local endpoints to the NIMs launched in steps 1-3. If you launched these NIMs across different systems, you can modify the base_url parameter to point to another system running the NIM.

Inside the local_deployment_single_gpu folder, edit the .env file and populate the NGC_API_KEY and HF_TOKEN fields. You can observe in this file, the VLM is set to Cosmos-Reason2.

#.env file

NGC_API_KEY=abc123***
HF_TOKEN=hf_***
.
.
.

After setting your API key in the .env file, you can launch VSS with Docker Compose.

docker compose up

Note

Guardrails has been disabled for single GPU deployment because of accuracy issues with the llama-3.1-8b-instruct model. If required, it can be enabled by removing the DISABLE_GUARDRAILS environment variable from the .env file.

VSS is ready when you observe:

After VSS has loaded you can access the UI at port 9100 and the backend REST API endpoint at port 8100.

Test the deployment by summarizing a sample video.

Stopping the Deployment#

To stop any of the Docker Compose deployments, run the following command from the deployment directory:

docker compose down

This will stop and remove all containers created by Docker Compose.

Note

To enable the live stream preview, set INSTALL_PROPRIETARY_CODECS=true in the .env file. To get the logs from containers running in detached mode, run docker logs <container_id>. To remove the containers, run docker rm <container_id>.

Configuring GPU Allocation#

To customize the Docker Compose deployment for various GPU configurations:

For VSS container, modify the NVIDIA_VISIBLE_DEVICES variable in the .env file .
For other containers launched using the docker run command, modify the --gpus '"device=0"' argument.

The GPU device IDs must be set based on:

The number of GPUs available on the system.
GPU requirements for each service.
- When using Cosmos-Reason2 or Cosmos-Reason1 or vLLM compatible models or a remote VLM endpoint, VSS requires at least 1 GPU (A100, H100, H200, B200, RTX PRO 6000 Blackwell SE, L40s).
- Embedding and Reranking require 1 GPU each but can share a GPU with VSS on an 80+ GB GPU.
- RIVA ASR requires 1 GPU but can share a GPU with Embedding and Reranking on an 80+ GB GPU.
- Check NVIDIA NIM for Large Language Models (LLMs) documentation for LLM GPU requirements.
GPUs can be shared even further by using the low memory modes and smaller VLM and LLM models as shown in Fully Local Deployment: Single GPU.
If using remote endpoints for any of the services, GPUs will not be used by these services and GPU requirements will be further reduced.