Get Started With the NVIDIA RAG Blueprint#
Use the following documentation to get started quickly with the NVIDIA RAG Blueprint. In this walkthrough you deploy the NVIDIA RAG Blueprint with Docker Compose for a single node deployment, and using self-hosted on-premises models. For other deployment options, refer to Deployment Options.
Tip
If you want to run the RAG Blueprint with NVIDIA AI Workbench, use Quickstart for NVIDIA AI Workbench.
Tip
Looking for a simpler setup without Docker? Check out the Containerless Deployment (Lite Mode) for a Python-only deployment using Milvus Lite and NVIDIA cloud APIs.
Prerequisites#
Warning
This deployment requires at least 200GB of free disk space to download and cache models, store vector databases, and run all required services. Ensure that you have sufficient storage available before you proceed.
Install Docker Engine. For more information, see Ubuntu.
Install Docker Compose. For more information, see install the Compose plugin.
a. Ensure the Docker Compose plugin version is 2.29.1 or later.
b. After you get the Docker Compose plugin installed, run
docker compose versionto confirm.To pull images required by the blueprint from NGC, you must first authenticate Docker with nvcr.io. Use the NGC API Key you created in the first step.
export NGC_API_KEY="nvapi-..." echo "${NGC_API_KEY}" | docker login nvcr.io -u '$oauthtoken' --password-stdin
Containers that are enabled with GPU acceleration, such as Milvus and NVIDIA NIMs, are deployed on-prem. To configure Docker for GPU-accelerated containers, install the NVIDIA Container Toolkit.
Ensure you meet the hardware requirements.
Clone the RAG Blueprint Git repository#
You can clone the RAG Blueprint repository to create a local working copy that you can run and modify, with full git history and an easy way to update from upstream.
Confirm that Git is installed on your machine. git-scm
Open a terminal and navigate to the directory where you want the project. git-scm
Clone the repository:
git clone https://github.com/NVIDIA-AI-Blueprints/rag.git
Change into the cloned directory:
cd rag
Fetch all remote branches and tags (optional but useful):
git fetch --all --tags
Check out the latest release branch:
git checkout release-<latest-release>
Start services using self-hosted on-premises models#
Use the following procedure to start all containers needed for this blueprint.
Create a directory to cache the models and export the path to the cache as an environment variable.
mkdir -p ~/.cache/model-cache export MODEL_DIRECTORY=~/.cache/model-cache
Export all the required environment variables to use on-prem models. Verify that the section
Endpoints for using cloud NIMsis commented in this file.source deploy/compose/.env
(For A100 SXM and B200 platforms only) Run the following code to allocate 2 available GPUs before you continue with the following steps.
export LLM_MS_GPU_ID=1,2
Start all required NIMs by running the following code.
Warning
Do not attempt this step unless you have completed the previous steps.
USERID=$(id -u) docker compose -f deploy/compose/nims.yaml up -d
Check the status of the deployment by running the following code. Wait until all services are up and the
nemotron-ranking-ms,nemotron-embedding-msandnim-llm-msNIMs are in healthy state before proceeding further.watch -n 2 'docker ps --format "table {{.Names}}\t{{.Status}}"'
Your output should look similar to the following.
NAMES STATUS nim-llm-ms Up 4 minutes (healthy) nemotron-ranking-ms Up 4 minutes (healthy) compose-graphic-elements-1 Up 4 minutes compose-page-elements-1 Up 4 minutes nemotron-embedding-ms Up 4 minutes (healthy) compose-nemoretriever-ocr-1 Up 4 minutes compose-table-structure-1 Up 4 minutes
Start the vector db containers from the repo root.
docker compose -f deploy/compose/vectordb.yaml up -d
Start the ingestion containers from the repo root. This pulls the prebuilt containers from NGC and deploys them on your system.
docker compose -f deploy/compose/docker-compose-ingestor-server.yaml up -d
You can check the status of the ingestor-server and running the following code.
curl -X 'GET' 'http://workstation_ip:8082/v1/health?check_dependencies=true' -H 'accept: application/json'
You should see output similar to the following.
{ "message": "Service is up.", "databases": [ ... ], "object_storage": [ ... ], "nim": [ { "service": "Embeddings", "status": "healthy", ... }, { "service": "Summary LLM", "status": "healthy", ... } ], "processing": [ { "service": "NeMo Retriever Library", "status": "healthy", ... } ], "task_management": [ { "service": "Redis", "status": "healthy", ... } ] }
Start the RAG containers from the repo root. This pulls the prebuilt containers from NGC and deploys them on your system.
docker compose -f deploy/compose/docker-compose-rag-server.yaml up -d
You can check the status of the rag-server by running the following code.
curl -X 'GET' 'http://workstation_ip:8081/v1/health?check_dependencies=true' -H 'accept: application/json'
You should see output similar to the following.
{ "message": "Service is up.", "databases": [ ... ], "object_storage": [ ... ], "nim": [ { "service": "LLM", "status": "healthy", ... }, { "service": "Embeddings", "status": "healthy", ... }, { "service": "Ranking", "status": "healthy", ... } ] }
Check the status of the deployment by running the following code.
docker ps --format "table {{.ID}}\t{{.Names}}\t{{.Status}}"
You should see output similar to the following. Confirm all the following containers are running.
CONTAINER ID NAMES STATUS 88181d20ba30 rag-frontend Up 2 minutes 5cf93ea91d4e rag-server Up 2 minutes 03ff43bd4f53 compose-nv-ingest-ms-runtime-1 Up 2 minutes (healthy) fcc703631b71 ingestor-server Up 2 minutes 77f64a4a5146 compose-redis-1 Up 2 minutes 902445432dde milvus-standalone Up 3 minutes (healthy) 340bc8210a0d milvus-minio Up 3 minutes (healthy) 0be702b87ad6 milvus-etcd Up 3 minutes (healthy) 62eabf1d9f65 nim-llm-ms Up 10 minutes (healthy) fe2751bfa734 nemotron-ranking-ms Up 10 minutes (healthy) 7b5ddabf8be7 compose-graphic-elements-1 Up 10 minutes ecfaa5190302 compose-page-elements-1 Up 10 minutes ea8c7fdf20d1 nemotron-embedding-ms Up 10 minutes (healthy) 6d62008a9b42 compose-nemoretriever-ocr-1 Up 10 minutes 969b9f5c987c compose-table-structure-1 Up 10 minutes
Experiment with the Web User Interface#
After the RAG Blueprint is deployed, you can use the RAG UI to start experimenting with it.
Open a web browser and access the RAG UI. You can start experimenting by uploading docs and asking questions. For details, see User Interface for NVIDIA RAG Blueprint.
Experiment with the Ingestion API Usage Notebook#
After the RAG Blueprint is deployed, you can use the Ingestion API Usage notebook to start experimenting with it. For details, refer to Experiment with the Ingestion API Usage Notebook.
Shut down services#
To stop all running services, run the following code.
docker compose -f deploy/compose/docker-compose-ingestor-server.yaml down docker compose -f deploy/compose/nims.yaml down docker compose -f deploy/compose/docker-compose-rag-server.yaml down docker compose -f deploy/compose/vectordb.yaml down
Service Port and GPU Reference#
For a complete reference of all services, their port mappings, and GPU assignments, see Service Port and GPU Reference
Advanced Deployment Considerations#
After the first time you deploy the RAG Blueprint successfully, you can consider the following advanced deployment options:
For information about advanced settings, see Best Practices for Common Settings.
To turn on recommended configurations for accuracy optimized profile set additional configs by running the following code:
source deploy/compose/accuracy_profile.env
To turn on recommended configurations for performance optimized profile set additional configs by running the following code:
source deploy/compose/perf_profile.env
To start just the services specific to RAG or ingestion add the
--profile ragor--profile ingestflag to the code. For example:USERID=$(id -u) docker compose -f deploy/compose/nims.yaml --profile rag up -d
If you make code changes and want to redeploy services, add the –build flag to your code. For example:
docker compose -f deploy/compose/docker-compose-*-server.yaml up -d --build
By default, GPU accelerated Milvus DB is deployed. You can choose the GPU ID to allocate by using the below env variable. For all service port mappings and GPU assignments, see Service Port and GPU Reference.
VECTORSTORE_GPU_DEVICE_ID=0
For improved accuracy, consider enabling reasoning mode. For details, refer to Enable thinking.
NeMo Retriever Library OCR is now the default OCR service. To use legacy Paddle OCR instead, refer to OCR Configuration Guide.
For advanced users who need direct filesystem access to extraction results, refer to Ingestor Server Volume Mounting.
A single NVIDIA A100-80GB or H100-80GB, B200 GPU can be used to start non-LLM NIMs (nemotron-embedding-ms, nemotron-ranking-ms, and ingestion services like page-elements, ocr, graphic-elements, and table-structure) for ingestion and RAG workflows. You can control which GPU is used for each service by setting these environment variables in
deploy/compose/.envfile before launching. For a complete list of all services and their default GPU assignments, see Service Port and GPU Reference.EMBEDDING_MS_GPU_ID=0 RANKING_MS_GPU_ID=0 YOLOX_MS_GPU_ID=0 YOLOX_GRAPHICS_MS_GPU_ID=0 YOLOX_TABLE_MS_GPU_ID=0 OCR_MS_GPU_ID=0
If the NIMs are deployed in a different workstation or outside the nvidia-rag docker network on the same system, replace the host address of the below URLs with workstation IPs.
APP_EMBEDDINGS_SERVERURL="workstation_ip:8000" APP_LLM_SERVERURL="workstation_ip:8000" APP_RANKING_SERVERURL="workstation_ip:8000" OCR_GRPC_ENDPOINT="workstation_ip:8001" YOLOX_GRPC_ENDPOINT="workstation_ip:8001" YOLOX_GRAPHIC_ELEMENTS_GRPC_ENDPOINT="workstation_ip:8001" YOLOX_TABLE_STRUCTURE_GRPC_ENDPOINT="workstation_ip:8001"