Deploy Using Docker Compose ARM#
Deployment Scenarios#
NVIDIA Jetson Thor and NVIDIA DGX Spark supports the following deployment scenarios:#
Deployment Scenario |
VLM |
LLM (Llama 3.1 70B) |
Embedding (llama-3.2-nv-embedqa-1b-v2) |
Reranker (llama-3.2-nv-rerankqa-1b-v2) |
CV |
Audio |
|---|---|---|---|---|---|---|
Remote (OpenAI gpt-4o) |
Remote |
Remote |
Remote |
Local |
Remote |
|
Local (Cosmos-Reason1-7B) |
Remote |
Remote |
Remote |
Local |
Remote |
|
Local (Cosmos-Reason1-7B) |
NA |
NA |
NA |
NA |
NA |
NVIDIA DGX Spark additionally supports Full Local Deployment scenario:#
[DGX Spark only] Single GPU Deployment (Full Local Deployment)
A Docker Compose setup is provided on GitHub to locally deploy VSS using VLM, LLM, Embedding, and Reranker models on a single GPU.
Deployment Scenario |
VLM (Cosmos-Reason1-7B) |
LLM (Llama 3.1 8B) |
Embedding (llama-3.2-nv-embedqa-1b-v2) |
Reranker (llama-3.2-nv-rerankqa-1b-v2) |
|---|---|---|---|---|
Local |
Local |
Local |
Local |
Note
All Docker commands on this page must be run without sudo. Ensure you can run Docker without
sudoby following the Docker post installation steps. Usingsudocan break the way environment variables are passed into the container.Refer to Prerequisites (NVIDIA Jetson Thor) and Prerequisites (NVIDIA DGX Spark) for the necessary prerequisites.
To get started, clone the GitHub repository to get the Docker Compose samples:
git clone https://github.com/NVIDIA-AI-Blueprints/video-search-and-summarization.git
cd video-search-and-summarization/deploy/docker
Run the cache cleaner script:
# In another terminal, start the cache cleaner script.
# Alternatively, append " &" to the end of the command to run it in the background.
sudo sh <video-search-and-summarization>/deploy/scripts/sys_cache_cleaner.sh
Log into NGC so all containers are accessible.
docker login nvcr.io
Then supply your NGC API Key.
Username: $oauthtoken
Password: <PASTE_API_KEY_HERE>
Follow one of the sections below to launch VSS with Docker Compose on Thor/DGX Spark.
Fully Local Deployment: Single GPU (DGX Spark only)
Note
For a list of all configuration options, refer to VSS Deployment-Time Configuration Glossary.
Remote Deployment#
The remote_vlm_deployment folder, contains an example of how to launch VSS using remote endpoints for the VLM, LLM, Embedding and Reranker models. This allows VSS to run with minimal hardware requirements. A modern GPU with at least 8GB of VRAM is recommended.
To run this deployment, get an NVIDIA API key from build.nvidia.com and an OpenAI API Key to use GPT-4o as the remote VLM. Any OpenAI compatible VLM can also be used for this.
cd remote_vlm_deployment
If you look in the config.yaml file, you will observe that the LLM, Reranker and Embedding model configurations have been set to use remote endpoints from build.nvidia.com.
Inside the remote_vlm_deployment folder, edit the .env file and populate the NVIDIA_API_KEY, NGC_API_KEY and OPENAI_API_KEY fields. Optionally, the VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME can be adjusted to use a different model.
The VIA_VLM_ENDPOINT can also be adjusted to point to any OpenAI compatible VLM. Optionally, to enable the CV pipeline, set DISABLE_CV_PIPELINE=false and INSTALL_PROPRIETARY_CODECS=true. For example:
#.env file
NVIDIA_API_KEY=nvapi-***
OPENAI_API_KEY=def456***
NGC_API_KEY=abc123***
#VIA_VLM_ENDPOINT=http://192.168.1.100:8000 # Optional
#VIA_VLM_OPENAI_MODEL_DEPLOYMENT_NAME=gpt-4o # Optional
DISABLE_CV_PIPELINE=true # Set to false to enable CV
INSTALL_PROPRIETARY_CODECS=false # Set to true to enable CV
.
.
.
Warning
The .env file will store your private API keys in plain text. Ensure proper security and permission settings are in place for this file or use a secrets manager to pass API key environment variables in a production environment.
After setting your API keys in the .env file, you can launch VSS with Docker Compose.
# export IS_SBSA=1 # For DGX Spark
docker compose up
After VSS has loaded you can access the UI at port 9100.
Stopping the Deployment#
To stop any of the Docker Compose deployments, run the following command from the deployment directory:
docker compose down
This will stop and remove all containers created by Docker Compose.
Hybrid Deployment#
The remote_llm_deployment folder, contains an example of how to launch VSS using the local built-in VLM and remote endpoints for the LLM, Embedding and ReRanker models.
This requires at least 40GB of VRAM to load the built-in Cosmos-Reason1 7B model. This can be run on Thor or DGX Spark.
cd remote_llm_deployment
If you look in the config.yaml file, you will observe that the LLM, Reranker and Embedding model configurations have been set to use remote endpoints from build.nvidia.com.
Inside the remote_llm_deployment folder, edit the .env file and populate the NVIDIA_API_KEY and NGC_API_KEY fields. You can observe in this file, the VLM is set to Cosmos-Reason1-7B. Optionally, to enable the CV pipeline, set DISABLE_CV_PIPELINE=false and INSTALL_PROPRIETARY_CODECS=true.
#.env file
NVIDIA_API_KEY=abc123***
NGC_API_KEY=def456***
DISABLE_CV_PIPELINE=true # Set to false to enable CV
INSTALL_PROPRIETARY_CODECS=false # Set to true to enable CV
.
.
.
After setting your API keys in the .env file, you can launch VSS with Docker Compose.
# export IS_SBSA=1 # For DGX Spark
docker compose up
After VSS has loaded you can access the UI at port 9100.
Stopping the Deployment#
To stop any of the Docker Compose deployments, run the following command from the deployment directory:
docker compose down
This will stop and remove all containers created by Docker Compose.
Fully Local Deployment: Single GPU#
Single GPU Deployment recipe using non-default low memory modes and smaller LLMs verified on DGX Spark is available below.
The local_deployment_single_gpu folder contains an example of how to launch VSS in a single GPU.
This deployment downloads and runs the VLM, LLM, Embedding, and Reranker models locally on one single GPU.
The example configuration assumes a DGX Spark deployment.
To run this example, you must launch local instances of the following components:
LLM
Embedding
Reranker
Note
CV and audio related features are currently not supported in Single GPU deployment. Additionally, when using a locally deployed LLM (Llama 3.1 8B) in this configuration, you may experience increased latency and reduced accuracy compared to cloud-hosted or larger LLM deployments.
Deploy LLM#
VSS can be configured to point to any OpenAI compatible LLM. This could be an LLM running on your local system through a NIM or a proprietary LLM deployed in the cloud. To deploy an LLM, the Llama 3.1 8B NIM is recommended. The LLM could also be served by a third party framework such as vLLM or SGLang.
To launch the recommended Llama 3.1 8B NIM, first login to nvcr.io:
docker login nvcr.io
Then supply your API Key from build.nvidia.com:
Username: $oauthtoken
Password: <PASTE_API_KEY_HERE>
After logged, modify the snippet below to add your API Key and run the command to launch the Llama 3.1 8B NIM on GPU 0.
For deployment on DGX Spark, the NIM_GPU_MEM_FRACTION environment variable is required to limit the memory consumption of LLM NIM. Set this to 0.2 for optimal performance.
export NGC_API_KEY=<PASTE_API_KEY_HERE>
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
docker run -d -u $(id -u) -it \
--gpus '"device=0"' --shm-size=16GB \
-e NGC_API_KEY=$NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-p 8007:8000 -e NIM_GPU_MEM_FRACTION=0.2 \
nvcr.io/nim/meta/llama-3.1-8b-instruct-dgx-spark:1.0
Verify that you have a live LLM endpoint at port 8007 for VSS.
Deploy Embedding NIM#
VSS requires the llama-3.2-nv-embedqa-1b-v2 embedding NIM to power interactive Q&A. View the NeMo Retriever Embedding documentation for more details on deployment and GPU requirements.
To launch the llama-3.2-nv-embedqa-1b-v2 embedding NIM, modify the snippet below to add your API Key and run the command to launch embedding NIM on GPU 0.
export NGC_API_KEY=<PASTE_API_KEY_HERE>
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
docker run -d -u $(id -u) -it \
--gpus '"device=0"' --shm-size=16GB \
-e NGC_API_KEY=$NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-p 8006:8000 -e NIM_SERVER_PORT=8000 \
-e NIM_MODEL_PROFILE="f7391ddbcb95b2406853526b8e489fedf20083a2420563ca3e65358ff417b10f" \
-e NIM_TRT_ENGINE_HOST_CODE_ALLOWED=1 \
nvcr.io/nim/nvidia/llama-3.2-nv-embedqa-1b-v2:1.9.0
Verify that you have a live embedding endpoint at port 8006 for VSS.
Deploy Reranker NIM#
VSS requires the llama-3.2-nv-rerankqa-1b-v2 reranker NIM to power interactive Q&A. View the NeMo Retriever Reranking documentation for more details on deployment and GPU requirements.
To launch the llama-3.2-nv-rerankqa-1b-v2 embedding NIM, modify the snippet below to add your API Key and run the command to launch rerank NIM on GPU 0.
export NGC_API_KEY=<PASTE_API_KEY_HERE>
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
docker run -d -u $(id -u) -it \
--gpus '"device=0"' --shm-size=16GB \
-e NGC_API_KEY=$NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-p 8005:8000 -e NIM_SERVER_PORT=8000 \
-e NIM_MODEL_PROFILE="f7391ddbcb95b2406853526b8e489fedf20083a2420563ca3e65358ff417b10f" \
nvcr.io/nim/nvidia/llama-3.2-nv-rerankqa-1b-v2:1.7.0
You should now have a live embedding endpoint at port 8005 for VSS.
Deploy VSS#
After all of the NIMs have been deployed, you can run the local_deployment_single_gpu example from here .
cd local_deployment_single_gpu
If you look in the config.yaml file, you will observe that the LLM, Reranker, and Embedding model configurations have been set to use local endpoints to the NIMs launched in steps 1-3. If you launched these NIMs across different systems, you can modify the base_url parameter to point to another system running the NIM.
Additionally, for DGX Spark, modify the config.yaml to enable the following line:
...
ingestion_function:
type: graph_ingestion
params:
...
disable_entity_extraction: True # uncomment this line for DGX Spark
Inside the local_deployment_single_gpu folder, edit the .env file and populate the NGC_API_KEY fields. You can observe in this file, the VLM is set to Cosmos-Reason1-7B.
For deployment on DGX Spark, the VLLM_GPU_MEMORY_UTILIZATION and VLM_BATCH_SIZE environment variables are required to limit the memory consumption of the VLLM. Set these to 0.3 and 32 for optimal performance.
#.env file
NGC_API_KEY=abc123***
VLLM_GPU_MEMORY_UTILIZATION=0.3
VLM_BATCH_SIZE=32
.
.
.
After setting your API key in the .env file, you can launch VSS with Docker Compose.
export IS_SBSA=1 # For DGX Spark
docker compose up
Note
Guardrails has been disabled for single GPU deployment because of accuracy issues with the llama-3.1-8b-instruct model. If required, it can be enabled by
removing the DISABLE_GUARDRAILS environment variable from the .env file.
VSS is ready when you observe:
After VSS has loaded you can access the UI at port 9100 and the backend REST API endpoint at port 8100.
Test the deployment by summarizing a sample video.
Stopping the Deployment#
To stop any of the Docker Compose deployments, run the following command from the deployment directory:
docker compose down
This will stop and remove all containers created by Docker Compose.
Note
To enable the live stream preview, set INSTALL_PROPRIETARY_CODECS=true in the .env file.
To get the logs from containers running in detached mode, run docker logs <container_id>.
To remove the containers, run docker rm <container_id>.