Deploy Using Docker Compose X86#
The Video Search and Summarization Agent blueprint can be deployed through Docker Compose to allow for more customization and flexibility.
Prerequisites#
In addition to the minimum GPU requirements as per the deployment scenarios described below, the following prerequisites must be met:
Ubuntu 22.04
NVIDIA driver 580.65.06 (Recommended minimum version)
CUDA 13.0+ (CUDA driver installed with NVIDIA driver)
NVIDIA Container Toolkit 1.13.5+
Docker 27.5.1+
Docker Compose 2.32.4
Deployment Components#
To fully launch the VSS blueprint the following parts are required:
LLM
Embedding Model
Reranking Model
VLM
VSS Engine
The Helm Chart deployment automatically launches all of these parts, however deploying VSS with Docker Compose, only the VSS Engine is launched, which does not include the LLM, Embedding or ReRanker models. These models must launched separately or configured to use remote endpoints.
For the VLM in VSS there are three options:
Use the built-in VLM
For the lowest latency experience, VSS supports a built-in VLM. When using this VLM, the model is tightly integrated with the video decoding pipeline, leading to higher throughput and lower summarization latency. If configured, VSS will automatically pull the VLM and built-in optimized TensorRT engine to run in the video ingestion pipeline. For example, set
VLM_MODEL_TO_USE=cosmos-reason1
orVLM_MODEL_TO_USE=nvila
orVLM_MODEL_TO_USE=vila-1.5
in the.env
file along with the requiredMODEL_PATH
, for more details refer to VLM Model Path.Use an OpenAI Compatible VLM
Any OpenAI compatible VLM can be used with VSS. This could be a proprietary VLM running in the cloud or a local VLM launched with a third party framework like vLLM or SGLang. For example, set
VLM_MODEL_TO_USE=openai-compat
in the.env
file.Use a Custom VLM
To use a VLM that does not provide an Open AI compatible REST API interface, you can follow the Custom Models section on how to add support for a new VLM.
Configuration Options#
Some options in the Docker Compose deployment can be configured.
These options include:
VSS deployment time configurations: VSS Deployment-Time Configuration Glossary.
Changing the default models in the VSS Docker Compose deployment: Plug-and-Play Overview.
Deployment Scenarios#
Multi-GPU Deployment#
Three Docker Compose setups are provided on GitHub to deploy VSS through Docker Compose on a range of hardware platforms to demonstrate how VSS can be configured to use both local and remote endpoints for the different components.
Deployment Scenario |
VLM (Cosmos-Reason1-7B) |
LLM (Llama 3.1 70B) |
Embedding (llama-3.2-nv-embedqa-1b-v2) |
Reranker (llama-3.2-nv-rerankqa-1b-v2) |
CV |
Audio |
Minimum GPU Requirement |
---|---|---|---|---|---|---|---|
Remote* |
Remote |
Remote |
Remote |
Local |
Remote |
Minimum 8GB VRAM GPU (without CV), Minimum 16GB VRAM GPU (with CV) |
|
Local |
Remote |
Remote |
Remote |
Local |
Remote |
1xB200, 1xH100, 1xA100, 1xL40S, 1xRTX PRO 6000 Blackwell SE |
|
Local |
Local |
Local |
Local |
Local |
Local |
4xB200, 4xH100, 8xA100, 8xL40S 4xRTX PRO 6000 Blackwell SE |
Deployments are not limited to only the examples provided in the table. You can mix and match local and remote components, distribute components across multiple systems and experiment with different LLMs and VLMs to find a suitable configuration for your hardware.
Note
Cosmos-Reason1 7B FP8 (default) is not supported on L40s. Use Cosmos-Reason1 7b FP16 instead by setting
MODEL_PATH
to git:https://huggingface.co/nvidia/Cosmos-Reason1-7B
in the .env
file.
Note
Remote Deployment uses openai/gpt-4o
as the VLM.
Single GPU Deployment (Full Local Deployment)#
A Docker Compose setup is provided on GitHub to locally deploy VSS using VLM, LLM, Embedding, and Reranker models on a single GPU.
Deployment Scenario |
VLM (Cosmos-Reason1-7B) |
LLM (Llama 3.1 8B) |
Embedding (llama-3.2-nv-embedqa-1b-v2) |
Reranker (llama-3.2-nv-rerankqa-1b-v2) |
Minimum GPU Requirement |
---|---|---|---|---|---|
Local |
Local |
Local |
Local |
1xH100, 1xRTX PRO 6000 Blackwell (SE) |
Note
All Docker commands on this page must be run without sudo. Ensure you can run Docker without sudo
by following the Docker post installation. Using sudo
can break the way environment variables are passed into the container.
To get started, clone the GitHub repository to get the Docker Compose samples
git clone https://github.com/NVIDIA-AI-Blueprints/video-search-and-summarization.git
cd video-search-and-summarization/deploy/docker
Log into NGC so all containers are accessible.
docker login nvcr.io
Supply your NGC API Key.
Username: $oauthtoken
Password: <PASTE_API_KEY_HERE>
Based on your available hardware, follow one of the sections below to launch VSS with Docker Compose.
Note
For a list of all configuration options, refer to VSS Deployment-Time Configuration Glossary.
Remote Deployment#
The remote_vlm_deployment
folder, contains an example of how to launch VSS using remote endpoints for the VLM, LLM, Embedding, and Reranker models. This allows VSS to run with minimal hardware requirements. A modern GPU with at least 8GB
of VRAM is recommended.
To run this deployment, get an NVIDIA API key from build.nvidia.com and an OpenAI API Key to use GPT-4o as the remote VLM. Any OpenAI compatible VLM can also be used for this.
cd remote_vlm_deployment
If you look in the config.yaml
file, you will observe that the LLM, Reranker, and Embedding model configurations have been set to use remote endpoints from build.nvidia.com
.
Inside the remote_vlm_deployment
folder, edit the .env
file and populate the NVIDIA_API_KEY
, NGC_API_KEY
and OPENAI_API_KEY
fields. Optionally, the VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME
can be adjusted to use a different model.
The VIA_VLM_ENDPOINT
can also be adjusted to point to any OpenAI compatible VLM. Optionally, to enable the CV pipeline, set DISABLE_CV_PIPELINE=false
and INSTALL_PROPRIETARY_CODECS=true
. For example:
#.env file
NVIDIA_API_KEY=nvapi-***
OPENAI_API_KEY=def456***
NGC_API_KEY=abc123***
#VIA_VLM_ENDPOINT=http://192.168.1.100:8000 # Optional
#VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME=gpt-4o # Optional
DISABLE_CV_PIPELINE=true # Set to false to enable CV
INSTALL_PROPRIETARY_CODECS=false # Set to true to enable CV
.
.
.
Warning
The .env file will store your private API keys in plain text. Ensure proper security and permission settings are in place for this file or use a secrets manager to pass API key environment variables in a production environment.
After setting your API keys in the .env
file, you can launch VSS with Docker Compose.
docker compose up
After VSS has loaded you can access the UI at port 9100.
Stopping the Deployment#
To stop any of the Docker Compose deployments, run the following command from the deployment directory:
docker compose down
This will stop and remove all containers created by Docker Compose.
Hybrid Deployment#
The remote_llm_deployment
folder, contains an example of how to launch VSS using the local built-in VLM and remote endpoints for the LLM, Embedding, and ReRanker models.
This requires at least 40GB of VRAM to load the built-in Cosmos-Reason1 7B model. This can be run on systems with 1xL40s, 1xA100 80GB, 1xH100, 1xB200 or 1xRTX PRO 6000 Blackwell SE GPUs.
cd remote_llm_deployment
If you look in the config.yaml
file, you will observe that the LLM, Reranker, and Embedding model configurations have been set to use remote endpoints from build.nvidia.com
.
Inside the remote_llm_deployment
folder, edit the .env
file and populate the NVIDIA_API_KEY
and NGC_API_KEY
fields. You can observe in this file, the VLM is set to Cosmos-Reason1-7B. Optionally, to enable the CV pipeline, set DISABLE_CV_PIPELINE=false
and INSTALL_PROPRIETARY_CODECS=true
.
#.env file
NVIDIA_API_KEY=abc123***
NGC_API_KEY=def456***
DISABLE_CV_PIPELINE=true # Set to false to enable CV
INSTALL_PROPRIETARY_CODECS=false # Set to true to enable CV
.
.
.
After setting your API keys in the .env
file, you can launch VSS with Docker Compose.
docker compose up
After VSS has loaded you can access the UI at port 9100.
Stopping the Deployment#
To stop any of the Docker Compose deployments, run the following command from the deployment directory:
docker compose down
This will stop and remove all containers created by Docker Compose.
Local Deployment#
The local_deployment
folder contains an example of how to launch VSS using the local built-in VLM and local endpoints for the LLM, Embedding and Reranker models.
Create Docker network for the local VSS deployment.
docker system prune
docker network create via-engine-${USER}
For details on minimum GPU requirement, check: Deployment Scenarios.
The example configuration below assumes a 4xH100 deployment where:
GPU 0 is used for VLM
GPUs 1,2 are used for LLM
GPU 3 is used for Embedding, Reranker and RIVA ASR
To run this example, you must launch local instances of the following components:
LLM
Embedding
Reranker
Deploy LLM#
VSS can be configured to point to any OpenAI compatible LLM. This could be an LLM running on your local system through a NIM or a proprietary LLM deployed in the cloud. To deploy an LLM, the Llama 3.1 70B NIM is recommended. The LLM could also be served by a third party framework such as vLLM or SGLang.
For summarization only use cases, models as small as Llama 3.1 8B can work well. However, for interactive Q&A, larger models are recommended to properly interface with the Graph Database.
Go to build.nvidia.com to explore available LLM NIMs. Each LLM NIM will have a Deploy tab with an example Docker run command to launch the NIM. LLM NIMs have different hardware requirements depending on the model’s size. Visit the LLM NIM Documentation for more details on deployment and GPU requirements.
To launch the recommended Llama 3.1 70B NIM:
Login to
nvcr.io
.docker login nvcr.io
Supply your API Key from build.nvidia.com
Username: $oauthtoken Password: <PASTE_API_KEY_HERE>
Modify the snippet below to add your API key and run the command to launch the Llama 3.1 70B NIM on GPUs 1 and 2:
Note
For optimal performance on specific hardware platforms, consider using hardware-optimized NIM model profiles. Profiles can vary depending on the number of GPUs and the hardware platform. See NIM Model Profile Optimization for detailed guidance on profile selection and configuration. IMPORTANT: RTX PRO 6000 users MUST use NIM version 1.13.1 and specific profile for llama-3.1-70b-instruct - this is mandatory, not optional.
[Skip for other platforms] Only for RTX Pro 6000 Blackwell, please set the following environment variables:
#only for RTX Pro 6000 Blackwell: export NIM_LLM_TAG=1.13.1 export NIM_MODEL_PROFILE="22bf424d6572fda243f890fe1a7c38fb0974c3ab27ebbbc7e2a2848d7af82bd6" #2 GPU profile
Launch the Llama 3.1 70B NIM:
export NGC_API_KEY=<PASTE_API_KEY_HERE> export LOCAL_NIM_CACHE=~/.cache/nim mkdir -p "$LOCAL_NIM_CACHE" docker run -d -it \ --gpus '"device=1,2"' \ --shm-size=16GB \ -e NGC_API_KEY \ -e NIM_MODEL_PROFILE \ -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \ -u $(id -u) \ -p 8000:8000 \ nvcr.io/nim/meta/llama-3.1-70b-instruct:${NIM_LLM_TAG:-1.10.1}
The first time this command is run, it will take some time to download and deploy the model. Ensure the LLM works by running a sample CURL command.
curl -X 'POST' \ 'http://0.0.0.0:8000/v1/chat/completions' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "model": "meta/llama-3.1-70b-instruct", "messages": [{"role":"user", "content":"Write a limerick about the wonders of GPU computing."}], "max_tokens": 64 }'
Verify that you have a live LLM endpoint at port 8000 for VSS.
Deploy Embedding NIM#
VSS requires the llama-3.2-nv-embedqa-1b-v2 embedding NIM to power interactive Q&A. View the NeMo Retriever Embedding documentation for more details on deployment and GPU requirements.
To launch the llama-3.2-nv-embedqa-1b-v2
embedding NIM, modify the snippet below to add your API Key and run the command to launch embedding NIM on GPU 3.
export NGC_API_KEY=<PASTE_API_KEY_HERE>
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
docker run -d -it \
--gpus '"device=3"' \
--shm-size=16GB \
-e NGC_API_KEY \
-e NIM_TRT_ENGINE_HOST_CODE_ALLOWED=1 \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 9234:8000 \
nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2:1.9.0
The first time this command is run, it will take some time to download and deploy the model. Ensure the Embedding NIM works by running a sample CURL command:
curl -X "POST" \
"http://0.0.0.0:9234/v1/embeddings" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"input": ["Hello world"],
"model": "nvidia/llama-3.2-nv-embedqa-1b-v2",
"input_type": "query"
}'
Verify that you have a live embedding endpoint at port 9234 for VSS.
Deploy Reranker NIM#
VSS requires the llama-3.2-nv-rerankqa-1b-v2 reranker NIM to power interactive Q&A. View the NeMo Retriever Reranking documentation for more details on deployment and GPU requirements.
To launch the llama-3.2-nv-rerankqa-1b-v2
embedding NIM, modify the snippet below to add your API key and run the command to launch rerank NIM on GPU 3. (Both the embedding and rerank NIMs can fit on the same H100.)
export NGC_API_KEY=<PASTE_API_KEY_HERE>
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
docker run -d -it \
--gpus '"device=3"' \
--shm-size=16GB \
-e NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 9235:8000 \
nvcr.io/nim/nvidia/llama-3.2-nv-rerankqa-1b-v2:1.7.0
The first time this command is run, it will take some time to download and deploy the model. Ensure the reranker NIM works by running a sample curl command.
curl -X "POST" \
"http://0.0.0.0:9235/v1/ranking" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "nvidia/llama-3.2-nv-rerankqa-1b-v2",
"query": {"text": "which way did the traveler go?"},
"passages": [
{"text": "two roads diverged in a yellow wood, and sorry i could not travel both and be one traveler, long i stood and looked down one as far as i could to where it bent in the undergrowth;"},
{"text": "then took the other, as just as fair, and having perhaps the better claim because it was grassy and wanted wear, though as for that the passing there had worn them really about the same,"},
{"text": "and both that morning equally lay in leaves no step had trodden black. oh, i marked the first for another day! yet knowing how way leads on to way i doubted if i should ever come back."},
{"text": "i shall be telling this with a sigh somewhere ages and ages hense: two roads diverged in a wood, and i, i took the one less traveled by, and that has made all the difference."}
],
"truncate": "END"
}'
Verify that you have a live embedding endpoint at port 9235
for VSS.
Deploy RIVA ASR NIM (Optional)#
VSS requires the RIVA ASR NIM to transcribe audio.
Set ENABLE_AUDIO=true
in the .env
file to enable audio transcription.
Note
Refer to the section: Using Riva ASR NIM from build.nvidia.com for obtaining RIVA_ASR_SERVER_API_KEY
to be set in the .env
file.
To launch the RIVA ASR NIM, modify the snippet below to add your API Key and run the command to launch the RIVA ASR NIM on GPU 3:
export NGC_API_KEY=<PASTE_API_KEY_HERE>
export CONTAINER_NAME=parakeet-ctc-asr
docker run -d -it --name=$CONTAINER_NAME \
--runtime=nvidia \
--gpus '"device=3"' \
--shm-size=8GB \
-e NGC_API_KEY=$NGC_API_KEY \
-e NIM_HTTP_API_PORT=9000 \
-e NIM_GRPC_API_PORT=50051 \
-e NIM_TAGS_SELECTOR=name=parakeet-0-6b-ctc-riva-en-us,mode=all \
--network=via-engine-${USER} \
nvcr.io/nim/nvidia/parakeet-0-6b-ctc-en-us:2.0.0
Verify that you have a live RIVA ASR endpoint at port 9000 for VSS.
Wait for the Riva ASR service to be ready.
RIVA_IP=$(docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' parakeet-ctc-asr)
curl -X 'GET' http://${RIVA_IP}:9000/v1/health/ready
Enable CV Pipeline (Optional)#
To enable the CV pipeline, set DISABLE_CV_PIPELINE=false
and INSTALL_PROPRIETARY_CODECS=true
in the .env
file. This will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.
DISABLE_CV_PIPELINE=false
INSTALL_PROPRIETARY_CODECS=true
Deploy VSS#
After all of the NIMs have been deployed, you can run the local_deployment
example from here .
cd local_deployment
If you look in the config.yaml
file, you will observe that the LLM, Reranker, and Embedding model configurations have been set to use local endpoints to the NIMs launched in steps 1-3. If you launched these NIMs across different systems, you can modify the base_url
parameter to point to another system running the NIM.
Inside the local_deployment
folder, edit the .env
file and populate the NGC_API_KEY
fields. You can observe in this file, the VLM is set to Cosmos-Reason1-7B.
#.env file
NGC_API_KEY=abc123***
.
.
.
After setting your API key in the .env
file, you can launch VSS with Docker Compose.
docker compose up
After VSS has loaded you can access the UI at port 9100 and the backend REST API endpoint at port 8100.
Test the deployment by summarizing a sample video.
Stopping the Deployment#
To stop any of the Docker Compose deployments, run the following command from the deployment directory:
docker compose down
This will stop and remove all containers created by Docker Compose.
Fully Local Deployment: Single GPU#
Single GPU Deployment recipe using non-default low memory modes and smaller LLMs verified on 1xB200, 1XH200, 1XA100 (80GB+, HBM) machine is available below.
The local_deployment_single_gpu
folder contains an example of how to launch VSS in a single GPU.
This deployment downloads and runs the VLM, LLM, Embedding, and Reranker models locally on one single GPU.
The example configuration assumes a 1xH100 (80GB) deployment.
To run this example, you must launch local instances of the following components:
LLM
Embedding
Reranker
Note
CV and audio related features are currently not supported in Single GPU deployment.
Deploy LLM#
VSS can be configured to point to any OpenAI compatible LLM. This could be an LLM running on your local system through a NIM or a proprietary LLM deployed in the cloud. To deploy an LLM, the Llama 3.1 8B NIM is recommended. The LLM could also be served by a third party framework such as vLLM or SGLang.
Go to build.nvidia.com to explore available LLM NIMs and the LLM NIM Documentation for more details on deployment and GPU requirements.
To launch the recommended Llama 3.1 8B NIM, first login to nvcr.io
:
docker login nvcr.io
Then supply your API Key from build.nvidia.com:
Username: $oauthtoken
Password: <PASTE_API_KEY_HERE>
After logged, modify the snippet below to add your API Key and run the command to launch the Llama 3.1 8B NIM on GPU 0.
For single GPU deployment, the NIM_LOW_MEMORY_MODE
and NIM_RELAX_MEM_CONSTRAINTS
environment variables are required to start LLM NIM in low memory mode.
export NGC_API_KEY=<PASTE_API_KEY_HERE>
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
docker run -d -u $(id -u) -it \
--gpus '"device=0"' --shm-size=16GB \
-e NGC_API_KEY=$NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-p 8007:8000 -e NIM_LOW_MEMORY_MODE=1 -e NIM_RELAX_MEM_CONSTRAINTS=1 \
nvcr.io/nim/meta/llama-3.1-8b-instruct:1.12.0
Verify that you have a live LLM endpoint at port 8007 for VSS.
Deploy Embedding NIM#
VSS requires the llama-3.2-nv-embedqa-1b-v2 embedding NIM to power interactive Q&A. View the NeMo Retriever Embedding documentation for more details on deployment and GPU requirements.
To launch the llama-3.2-nv-embedqa-1b-v2
embedding NIM, modify the snippet below to add your API Key and run the command to launch embedding NIM on GPU 0.
export NGC_API_KEY=<PASTE_API_KEY_HERE>
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
docker run -d -u $(id -u) -it \
--gpus '"device=0"' --shm-size=16GB \
-e NGC_API_KEY=$NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-p 8006:8000 -e NIM_SERVER_PORT=8000 \
-e NIM_MODEL_PROFILE="f7391ddbcb95b2406853526b8e489fedf20083a2420563ca3e65358ff417b10f" \
-e NIM_TRT_ENGINE_HOST_CODE_ALLOWED=1 \
nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2:1.9.0
Verify that you have a live embedding endpoint at port 8006 for VSS.
Deploy Reranker NIM#
VSS requires the llama-3.2-nv-rerankqa-1b-v2 reranker NIM to power interactive Q&A. View the NeMo Retriever Reranking documentation for more details on deployment and GPU requirements.
To launch the llama-3.2-nv-rerankqa-1b-v2
embedding NIM, modify the snippet below to add your API Key and run the command to launch rerank NIM on GPU 0.
export NGC_API_KEY=<PASTE_API_KEY_HERE>
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
docker run -d -u $(id -u) -it \
--gpus '"device=0"' --shm-size=16GB \
-e NGC_API_KEY=$NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-p 8005:8000 -e NIM_SERVER_PORT=8000 \
-e NIM_MODEL_PROFILE="f7391ddbcb95b2406853526b8e489fedf20083a2420563ca3e65358ff417b10f" \
nvcr.io/nim/nvidia/llama-3.2-nv-rerankqa-1b-v2:1.7.0
You should now have a live embedding endpoint at port 8005 for VSS.
Deploy VSS#
After all of the NIMs have been deployed, you can run the local_deployment_single_gpu
example from here .
cd local_deployment_single_gpu
If you look in the config.yaml
file, you will observe that the LLM, Reranker, and Embedding model configurations have been set to use local endpoints to the NIMs launched in steps 1-3. If you launched these NIMs across different systems, you can modify the base_url
parameter to point to another system running the NIM.
Inside the local_deployment_single_gpu
folder, edit the .env
file and populate the NGC_API_KEY
fields. You can observe in this file, the VLM is set to Cosmos-Reason1-7B.
#.env file
NGC_API_KEY=abc123***
.
.
.
After setting your API key in the .env
file, you can launch VSS with Docker Compose.
docker compose up
Note
Guardrails has been disabled for single GPU deployment because of accuracy issues with the llama-3.1-8b-instruct
model. If required, it can be enabled by
removing the DISABLE_GUARDRAILS
environment variable from the .env
file.
VSS is ready when you observe:

After VSS has loaded you can access the UI at port 9100 and the backend REST API endpoint at port 8100.
Test the deployment by summarizing a sample video.
Stopping the Deployment#
To stop any of the Docker Compose deployments, run the following command from the deployment directory:
docker compose down
This will stop and remove all containers created by Docker Compose.
Note
To enable the live stream preview, set INSTALL_PROPRIETARY_CODECS=true
in the .env
file.
To get the logs from containers running in detached mode, run docker logs <container_id>
.
To remove the containers, run docker rm <container_id>
.
Configuring GPU Allocation#
To customize the Docker Compose deployment for various GPU configurations:
For VSS container, modify the
NVIDIA_VISIBLE_DEVICES
variable in the .env file .For other containers launched using the
docker run
command, modify the--gpus '"device=0"'
argument.
The GPU device IDs must be set based on:
The number of GPUs available on the system.
GPU requirements for each service.
When using VILA-1.5 VLM, VSS requires at least 1 GPU on an 80+ GB GPU (A100, H100, H200, B200, RTX PRO 6000 Blackwell SE) and at least 2 GPUs on an 48 GB GPU (L40s).
When using Cosmos-Reason1 or NVILA VLM or a remote VLM endpoint, VSS requires at least 1 GPU (A100, H100, H200, B200, RTX PRO 6000 Blackwell SE, L40s).
Embedding and Reranking require 1 GPU each but can share a GPU with VSS on an 80+ GB GPU.
RIVA ASR requires 1 GPU but can share a GPU with Embedding and Reranking on an 80+ GB GPU.
Check NVIDIA NIM for Large Language Models (LLMs) documentation for LLM GPU requirements.
GPUs can be shared even further by using the low memory modes and smaller VLM and LLM models as shown in Fully Local Deployment: Single GPU.
If using remote endpoints for any of the services, GPUs will not be used by these services and GPU requirements will be further reduced.