Deploy Retrieval-Only Mode for NVIDIA RAG Blueprint#

This guide explains how to deploy the NVIDIA RAG Blueprint for retrieval-only use cases without deploying the LLM generation components. This deployment mode is ideal when you only need document search and retrieval capabilities, saving GPU resources by not running the LLM NIM.

Overview#

In retrieval-only mode, you deploy:

Embedding NIM - For converting queries to vectors
Reranking NIM - For reordering retrieved results by relevance
Vector Database - For storing and searching document embeddings
RAG Server - For handling /search API requests

You skip deploying:

LLM NIM (nim-llm-ms) - Not needed for retrieval-only workflows

This configuration allows you to use the /search API endpoint to retrieve relevant documents without generating LLM responses, significantly reducing GPU memory requirements.

Use Cases#

Retrieval-only deployments are useful for:

Search Applications: Building document search systems without answer generation
Retrieval Pipelines: Integrating with your own LLM or downstream processing
Resource-Constrained Environments: When GPU resources are limited
Custom Generation: Using retrieved documents with an external LLM service
Testing and Development: Validating retrieval quality before adding generation

Prerequisites#

Important

Before you deploy the RAG Blueprint, consider the following:

For self-hosted NIMs, ensure that you have at least 50-80GB of available disk space for embedding and reranking model caches (significantly less than full deployment).
First-time deployment takes 5-10 minutes for self-hosted NIMs, or 2-3 minutes for NVIDIA-hosted models.
Model downloads do not show progress bars.

For monitoring deployment progress, refer to Deploy on Kubernetes with Helm.

Get an API Key.
Install Docker Engine and Docker Compose. Ensure Docker Compose version is 2.29.1 or later.

Authenticate Docker with NGC:

export NGC_API_KEY="nvapi-..."
echo "${NGC_API_KEY}" | docker login nvcr.io -u '$oauthtoken' --password-stdin

Install the NVIDIA Container Toolkit.
Clone the RAG Blueprint Git repository to get the necessary deployment files.

Deploy Retrieval-Only Mode with Docker Compose#

Step 1: Set Up Environment#

Create a directory to cache the models:

mkdir -p ~/.cache/model-cache
export MODEL_DIRECTORY=~/.cache/model-cache

Export the required environment variables:

# For self-hosted NIMs
source deploy/compose/.env

# For NVIDIA-hosted NIMs
source deploy/compose/nvdev.env

Step 2: Start Retrieval NIMs Only#

Choose one of the following options based on your deployment preference.

Option A: Self-Hosted NIMs#

Instead of starting all NIMs, use the text-embed profile to start only the embedding and reranking services:

USERID=$(id -u) docker compose -f deploy/compose/nims.yaml up -d nemotron-ranking-ms nemotron-embedding-ms

Note

The text-embed profile starts only nemotron-embedding-ms and nemotron-ranking-ms , which is sufficient for retrieval operations. The LLM NIM (nim-llm-ms) is not started, saving significant GPU memory.

Wait for the services to become healthy:

watch -n 2 'docker ps --format "table {{.Names}}\t{{.Status}}"'

Expected output:

NAMES                          STATUS
nemotron-ranking-ms       Up 5 minutes (healthy)
nemotron-embedding-ms     Up 5 minutes (healthy)

Option B: NVIDIA-Hosted NIMs#

For an even lighter deployment, use NVIDIA-hosted NIMs for embedding and reranking while running only the RAG server locally:

# Configure to use NVIDIA-hosted endpoints
export APP_EMBEDDINGS_SERVERURL=""
export APP_RANKING_SERVERURL=""

Note

When APP_EMBEDDINGS_SERVERURL and APP_RANKING_SERVERURL are empty, the RAG server uses NVIDIA-hosted API endpoints (requires valid NGC_API_KEY).

Step 3: Start the Vector Database#

docker compose -f deploy/compose/vectordb.yaml up -d

Step 4: Start the RAG Server#

docker compose -f deploy/compose/docker-compose-rag-server.yaml up -d rag-server

Verify the RAG server is running:

curl -X 'GET' 'http://localhost:8081/v1/health?check_dependencies=true' -H 'accept: application/json'

Step 5: (Optional) Start the Ingestion Server#

If you need to ingest documents, start the ingestion server:

docker compose -f deploy/compose/docker-compose-ingestor-server.yaml up -d ingestor-server

Tip

If you already have documents ingested from a previous deployment, you can skip this step and use the existing collections.

Using the Search API#

The /search endpoint retrieves relevant documents without LLM generation. This is the primary API for retrieval-only mode.

Basic Search Request#

import requests

url = "http://localhost:8081/v1/search"
payload = {
    "query": "What are the key features of the product?",
    "collection_names": ["my_collection"],
    "enable_reranker": True
}

response = requests.post(url, json=payload)
results = response.json()

# Process retrieved documents
for doc in results.get("citations", []):
    print(f"Source: {doc['source']}")
    print(f"Content: {doc['content'][:200]}...")
    print(f"Score: {doc.get('score', 'N/A')}")
    print("---")

Search with Metadata Filtering#

payload = {
    "query": "What are the key features of the product?",
    "collection_names": ["my_collection"],
    "enable_reranker": True,
    # Filter by custom metadata
    "filter_expr": 'content_metadata["category"] == "electronics"'
}

Using the CLI Script#

You can also use the provided CLI script for search operations:

# Basic search
python scripts/retriever_api_usage.py --mode search "Tell me about the product features"

# Search with specific collection
python scripts/retriever_api_usage.py \
    --mode search \
    --payload-json '{"collection_names":["my_collection"], "reranker_top_k": 5}' \
    "What is the return policy?"

# Save results to file
python scripts/retriever_api_usage.py \
    --mode search \
    --output-json results.json \
    "Technical specifications"

Deploy with Helm (Kubernetes)#

For Kubernetes deployments, configure the Helm chart to disable the LLM NIM:

helm upgrade --install rag nvidia-blueprint-rag \
  --namespace rag \
  --set nimOperator.nim-llm.enabled=false \
  --set nimOperator.nvidia-nim-llama-32-nv-embedqa-1b-v2.enabled=true \
  --set nimOperator.nvidia-nim-llama-32-nv-rerankqa-1b-v2.enabled=true \
  --set imagePullSecret.password=$NGC_API_KEY \
  --set ngcApiSecret.password=$NGC_API_KEY

Or modify values.yaml:

# Disable LLM NIM for retrieval-only deployment
nimOperator:
  nim-llm:
    enabled: false

  # Keep embedding and reranking NIMs enabled
  nvidia-nim-llama-32-nv-embedqa-1b-v2:
    enabled: true

  nvidia-nim-llama-32-nv-rerankqa-1b-v2:
    enabled: true

Integration with External LLMs#

After retrieving documents, you can send them to your own LLM for generation:

import requests

# Step 1: Retrieve relevant documents
search_url = "http://localhost:8081/v1/search"
search_payload = {
    "query": "What are the key features of the product?",
    "reranker_top_k": 5,
    "collection_names": ["my_collection"],
    "enable_reranker": True
}
search_response = requests.post(search_url, json=search_payload)
citations = search_response.json().get("citations", [])

# Step 2: Format context from retrieved documents
context = "\n\n".join([
    f"[Source: {doc['source']}]\n{doc['content']}"
    for doc in citations
])

# Step 3: Send to your LLM
prompt = f"""Based on the following context, answer the question.

Context:
{context}

Question: Tell me more about the feature XYZ of the product?

Answer:"""

# Use your preferred LLM API (OpenAI, Claude, local model, etc.)
llm_response = your_llm_client.generate(prompt)

GPU Resource Comparison#

Deployment Mode	Required GPUs	Memory Usage
Full RAG (with LLM)	2-4 GPUs	~160GB+
Retrieval-Only	1 GPU	~24GB
Cloud-Hosted NIMs	0 GPUs	N/A

Note

GPU requirements depend on the specific embedding and reranking models used. The values above are estimates for the default models.

Troubleshooting#

Generate endpoint returns error#

This is expected behavior in retrieval-only mode. The /generate endpoint requires an LLM, which is not deployed. Use the /search endpoint instead.

Embedding service not healthy#

Check the embedding NIM logs:

docker logs nemotron-embedding-ms

Ensure the model cache directory has proper permissions:

chmod -R 755 ~/.cache/model-cache

Search returns empty results#

Verify documents are ingested in the collection:

curl -X GET "http://localhost:8082/v1/documents?collection_name=my_collection"

Check that the collection name in the search request matches the ingested collection.
Try increasing vdb_top_k to retrieve more candidates.

Shut Down Services#

To stop all retrieval-only services:

docker compose -f deploy/compose/docker-compose-rag-server.yaml down
docker compose -f deploy/compose/vectordb.yaml down
docker compose -f deploy/compose/nims.yaml down

Deploy Retrieval-Only Mode for NVIDIA RAG Blueprint#

Overview#

Use Cases#

Prerequisites#

Deploy Retrieval-Only Mode with Docker Compose#

Step 1: Set Up Environment#

Step 2: Start Retrieval NIMs Only#

Option A: Self-Hosted NIMs#

Option B: NVIDIA-Hosted NIMs#

Step 3: Start the Vector Database#

Step 4: Start the RAG Server#

Step 5: (Optional) Start the Ingestion Server#

Using the Search API#

Basic Search Request#

Search with Metadata Filtering#

Using the CLI Script#

Deploy with Helm (Kubernetes)#

Integration with External LLMs#

GPU Resource Comparison#

Troubleshooting#

Generate endpoint returns error#

Embedding service not healthy#

Search returns empty results#

Shut Down Services#

Related Topics#