Get Started With NVIDIA RAG Blueprint#
Use the following documentation to get started quickly with the NVIDIA RAG Blueprint. In this walkthrough you deploy the NVIDIA RAG Blueprint with Docker Compose for a single node deployment, and using self-hosted on-premises models. For other deployment options, refer to Deployment Options.
Tip
If you want to run the RAG Blueprint with NVIDIA AI Workbench, use Quickstart for NVIDIA AI Workbench.
Prerequisites#
Install Docker Engine. For more information, see Ubuntu.
Install Docker Compose. For more information, see install the Compose plugin.
a. Ensure the Docker Compose plugin version is 2.29.1 or later.
b. After you get the Docker Compose plugin installed, run
docker compose versionto confirm.To pull images required by the blueprint from NGC, you must first authenticate Docker with nvcr.io. Use the NGC API Key you created in the first step.
export NGC_API_KEY="nvapi-..." echo "${NGC_API_KEY}" | docker login nvcr.io -u '$oauthtoken' --password-stdin
Containers that are enabled with GPU acceleration, such as Milvus and NVIDIA NIMS, deployed on-prem. To configure Docker for GPU-accelerated containers, install the NVIDIA Container Toolkit.
Ensure you meet the hardware requirements.
Start services using self-hosted on-premises models#
Use the following procedure to start all containers needed for this blueprint.
Create a directory to cache the models and export the path to the cache as an environment variable.
mkdir -p ~/.cache/model-cache export MODEL_DIRECTORY=~/.cache/model-cache
Export all the required environment variables to use on-prem models. Verify that the section
Endpoints for using cloud NIMsis commented in this file.source deploy/compose/.env
(For A100 SXM and B200 platforms only) Run the following code to allocate 2 available GPUs before you continue with the following steps.
export LLM_MS_GPU_ID=1,2
List the available model profiles for your hardware by running the following code.
USERID=$(id -u) docker compose -f deploy/compose/nims.yaml run nim-llm list-model-profiles
The output depends on your hardware. The following example output is for an H100-NVL with 1 GPU allocated.
MODEL PROFILES - Compatible with system and runnable: - d4910... (vllm-bf16-tp1-pp1-32c3...) - e2f00... (vllm) - e759b... (tensorrt_llm-h100_nvl-fp8-tp1-pp1-throughput-2321:10de-6343e...) - 668b5... (tensorrt_llm) - 50e13... (sglang)
Using the list of model profile from the previous step, set the
NIM_MODEL_PROFILE. It is ideal to select one of thetensorrt_llmprofiles for best performance. Because of a known issue, vllm-based profiles are selected, so we recommend that you manually select atensorrt_llmprofile before you start thenim-llmservice.export NIM_MODEL_PROFILE="......" # Populate your profile name as per hardware
Start all required NIMs by running the following code.
Warning
Do not attempt this step unless you have completed the previous steps.
USERID=$(id -u) docker compose -f deploy/compose/nims.yaml up -d
The NIM LLM service can take 30 mins to start for the first time as the model is downloaded and cached. Subsequent deployments can take 2-5 minutes, depending on the GPU profile.
Tip
The models are downloaded and cached in the path specified by
MODEL_DIRECTORY.Check the status of the deployment by running the following code. Wait until all services are up and the
nemoretriever-ranking-ms,nemoretriever-embedding-msandnim-llm-msNIMs are in healthy state before proceeding further.watch -n 2 'docker ps --format "table {{.Names}}\t{{.Status}}"'
Your output should look similar to the following.
NAMES STATUS nemoretriever-ranking-ms Up 14 minutes (healthy) compose-page-elements-1 Up 14 minutes compose-paddle-1 Up 14 minutes compose-graphic-elements-1 Up 14 minutes compose-table-structure-1 Up 14 minutes nemoretriever-embedding-ms Up 14 minutes (healthy) nim-llm-ms Up 14 minutes (healthy)
Start the vector db containers from the repo root.
docker compose -f deploy/compose/vectordb.yaml up -d
Start the ingestion containers from the repo root. This pulls the prebuilt containers from NGC and deploys them on your system.
docker compose -f deploy/compose/docker-compose-ingestor-server.yaml up -d
You can check the status of the ingestor-server and running the following code.
curl -X 'GET' 'http://workstation_ip:8082/v1/health?check_dependencies=true' -H 'accept: application/json'
You should see output similar to the following.
{ "message": "Service is up.", "databases": [ ... ], "object_storage": [ ... ], "nim": [ { "service": "Embeddings", "status": "healthy", ... }, { "service": "Summary LLM", "status": "healthy", ... } ], "processing": [ { "service": "NV-Ingest", "status": "healthy", ... } ], "task_management": [ { "service": "Redis", "status": "healthy", ... } ] }
Start the RAG containers from the repo root. This pulls the prebuilt containers from NGC and deploys them on your system.
docker compose -f deploy/compose/docker-compose-rag-server.yaml up -d
You can check the status of the rag-server by running the following code.
curl -X 'GET' 'http://workstation_ip:8081/v1/health?check_dependencies=true' -H 'accept: application/json'
You should see output similar to the following.
{ "message": "Service is up.", "databases": [ ... ], "object_storage": [ ... ], "nim": [ { "service": "LLM", "status": "healthy", ... }, { "service": "Embeddings", "status": "healthy", ... }, { "service": "Ranking", "status": "healthy", ... } ] }
Check the status of the deployment by running the following code.
docker ps --format "table {{.ID}}\t{{.Names}}\t{{.Status}}"
You should see output similar to the following. Confirm all the following containers are running.
NAMES STATUS compose-nv-ingest-ms-runtime-1 Up 5 minutes (healthy) ingestor-server Up 5 minutes compose-redis-1 Up 5 minutes rag-frontend Up 9 minutes rag-server Up 9 minutes milvus-standalone Up 36 minutes milvus-minio Up 35 minutes (healthy) milvus-etcd Up 35 minutes (healthy) nemoretriever-ranking-ms Up 38 minutes (healthy) compose-page-elements-1 Up 38 minutes compose-paddle-1 Up 38 minutes compose-graphic-elements-1 Up 38 minutes compose-table-structure-1 Up 38 minutes nemoretriever-embedding-ms Up 38 minutes (healthy) nim-llm-ms Up 38 minutes (healthy)
Experiment with the Web User Interface#
After the RAG Blueprint is deployed, you can use the RAG UI to start experimenting with it.
Open a web browser and access the RAG UI. You can start experimenting by uploading docs and asking questions. For details, see User Interface for NVIDIA RAG Blueprint.
Experiment with the Ingestion API Usage Notebook#
After the RAG Blueprint is deployed, you can use the Ingestion API Usage notebook to start experimenting with it. For details, refer to Experiment with the Ingestion API Usage Notebook.
Shut down services#
To stop all running services, run the following code.
docker compose -f deploy/compose/docker-compose-ingestor-server.yaml down docker compose -f deploy/compose/nims.yaml down docker compose -f deploy/compose/docker-compose-rag-server.yaml down docker compose -f deploy/compose/vectordb.yaml down
Advanced Deployment Considerations#
After the first time you deploy the RAG Blueprint successfully, you can consider the following advanced deployment options:
For information about advanced settings, see Best Practices for Common Settings.
To turn on recommended configurations for accuracy optimized profile set additional configs by running the following code:
source deploy/compose/accuracy_profile.env
To turn on recommended configurations for performance optimized profile set additional configs by running the following code:
source deploy/compose/perf_profile.env
To start just the services specific to RAG or ingestion add the
--profile ragor--profile ingestflag to the code. For example:USERID=$(id -u) docker compose -f deploy/compose/nims.yaml up -d --profile rag
If you make code changes and want to redeploy services, add the –build flag to your code. For example:
docker compose -f deploy/compose/docker-compose-*-server.yaml up -d --build
By default, GPU accelerated Milvus DB is deployed. You can choose the GPU ID to allocate by using the below env variable.
VECTORSTORE_GPU_DEVICE_ID=0
For improved accuracy, consider enabling reasoning mode. For details, refer to Enable thinking.
To use NeMo Retriever OCR (Early Access) instead of Paddle OCR, refer to NeMo Retriever OCR.
For advanced users who need direct filesystem access to extraction results, refer to Ingestor Server Volume Mounting.
A single NVIDIA A100-80GB or H100-80GB, B200 GPU can be used to start non-LLM NIMs (nemoretriever-embedding-ms, nemoretriever-ranking-ms, and ingestion services like page-elements, ocr, graphic-elements, and table-structure) for ingestion and RAG workflows. You can control which GPU is used for each service by setting these environment variables in
deploy/compose/.envfile before launching:EMBEDDING_MS_GPU_ID=0 RANKING_MS_GPU_ID=0 YOLOX_MS_GPU_ID=0 YOLOX_GRAPHICS_MS_GPU_ID=0 YOLOX_TABLE_MS_GPU_ID=0 OCR_MS_GPU_ID=0
If the NIMs are deployed in a different workstation or outside the nvidia-rag docker network on the same system, replace the host address of the below URLs with workstation IPs.
APP_EMBEDDINGS_SERVERURL="workstation_ip:8000" APP_LLM_SERVERURL="workstation_ip:8000" APP_RANKING_SERVERURL="workstation_ip:8000" OCR_GRPC_ENDPOINT="workstation_ip:8001" YOLOX_GRPC_ENDPOINT="workstation_ip:8001" YOLOX_GRAPHIC_ELEMENTS_GRPC_ENDPOINT="workstation_ip:8001" YOLOX_TABLE_STRUCTURE_GRPC_ENDPOINT="workstation_ip:8001"