Data Catalog for NVIDIA RAG Blueprint#

The Data Catalog feature enables comprehensive metadata management for collections and documents in the NVIDIA RAG Blueprint. This feature provides organization, governance, and discovery capabilities for your knowledge base through collection-level and document-level catalog metadata.

After you have deployed the blueprint, the Data Catalog endpoints are automatically available. No additional configuration is required.

Overview#

The Data Catalog provides two types of metadata:

Collection Catalog Metadata: Organizational metadata for entire collections (description, tags, owner, business domain, status)
Document Catalog Metadata: Metadata for individual documents within collections (description, tags)

Additionally, the system automatically populates content metrics such as number_of_files, has_tables, has_charts, and has_images to help you understand what content each collection contains.

Collection Catalog Metadata#

Supported Fields#

Field	Type	Description	Example
`description`	string	Human-readable description of the collection	“Q4 2024 Financial Reports”
`tags`	array[string]	Tags for categorization and discovery	`["finance", "q4-2024"]`
`owner`	string	Team or person responsible	“Finance Team”
`created_by`	string	User who created the collection	“john.doe@company.com”
`business_domain`	string	Business domain or department	“Finance”, “Legal”, “Engineering”
`status`	string	Collection lifecycle status	“Active”, “Archived”, “Deprecated”
`date_created`	timestamp	Automatically set on creation	“2024-11-18T10:30:00+00:00”
`last_updated`	timestamp	Automatically updated on changes	“2024-11-18T15:45:00+00:00”

Auto-Populated Content Metrics#

The system automatically analyzes ingested content and provides these metrics:

Metric	Type	Description
`number_of_files`	integer	Total number of documents in the collection
`last_indexed`	timestamp	Last time documents were ingested
`ingestion_status`	string	Current ingestion status
`has_tables`	boolean	Whether collection contains table content
`has_charts`	boolean	Whether collection contains charts/diagrams
`has_images`	boolean	Whether collection contains images

Creating Collections with Catalog Metadata#

Using the API#

import requests

url = "http://localhost:8082/v1/collection"

data = {
    "collection_name": "financial_reports_2024",
    "embedding_dimension": 2048,
    "description": "Q4 2024 Financial Reports and Analysis",
    "tags": ["finance", "reports", "q4-2024"],
    "owner": "Finance Team",
    "created_by": "john.doe@company.com",
    "business_domain": "Finance",
    "status": "Active",
    "metadata_schema": []  # Add custom metadata schema if needed
}

response = requests.post(url, json=data)
print(response.json())

Using the Python Client#

from nvidia_rag import NvidiaRAGIngestor

ingestor = NvidiaRAGIngestor()

result = ingestor.create_collection(
    collection_name="financial_reports_2024",
    vdb_endpoint="http://localhost:19530",
    description="Q4 2024 Financial Reports and Analysis",
    tags=["finance", "reports", "q4-2024"],
    owner="Finance Team",
    created_by="john.doe@company.com",
    business_domain="Finance",
    status="Active"
)

Note

All catalog metadata fields are optional. If not provided, they will be empty strings or empty arrays by default.

Updating Collection Metadata#

You can update collection catalog metadata at any time without re-ingesting documents:

import requests

url = "http://localhost:8082/v1/collections/financial_reports_2024/metadata"

updates = {
    "description": "Q4 2024 Financial Reports - Final Version",
    "tags": ["finance", "reports", "q4-2024", "final", "approved"],
    "status": "Archived",
    "business_domain": "Finance"
}

response = requests.patch(url, json=updates)
print(response.json())

Note

The PATCH endpoint performs a merge update. Only provided fields are updated; omitted fields retain their current values.

Document Catalog Metadata#

Updating Document Metadata#

After ingesting documents, you can add descriptive metadata to individual documents:

import requests

url = "http://localhost:8082/v1/collections/financial_reports_2024/documents/annual_report.pdf/metadata"

updates = {
    "description": "Annual Financial Report 2024 - Comprehensive Overview",
    "tags": ["annual", "comprehensive", "board-approved"]
}

response = requests.patch(url, json=updates)
print(response.json())

Retrieving Collections with Catalog Data#

Using the API#

import requests

url = "http://localhost:8082/v1/collections"
response = requests.get(url)
result = response.json()

for collection in result.get("collections", []):
    info = collection.get('collection_info', {})
    print(f"Collection: {collection['collection_name']}")
    print(f"  Description: {info.get('description', 'N/A')}")
    print(f"  Tags: {info.get('tags', [])}")
    print(f"  Owner: {info.get('owner', 'N/A')}")
    print(f"  Status: {info.get('status', 'N/A')}")
    print(f"  Files: {info.get('number_of_files', 0)}")
    print(f"  Has Tables: {info.get('has_tables', False)}")
    print(f"  Has Charts: {info.get('has_charts', False)}")
    print(f"  Has Images: {info.get('has_images', False)}")
    print()

Example Response#

{
  "collections": [
    {
      "collection_name": "financial_reports_2024",
      "num_entities": 1250,
      "metadata_schema": [],
      "collection_info": {
        "description": "Q4 2024 Financial Reports - Final Version",
        "tags": ["finance", "reports", "q4-2024", "final"],
        "owner": "Finance Team",
        "created_by": "john.doe@company.com",
        "business_domain": "Finance",
        "status": "Archived",
        "date_created": "2024-11-18T10:30:00+00:00",
        "last_updated": "2024-11-18T15:45:00+00:00",
        "number_of_files": 15,
        "last_indexed": "2024-11-18T14:20:00+00:00",
        "ingestion_status": "completed",
        "has_tables": true,
        "has_charts": true,
        "has_images": false
      }
    }
  ],
  "total_collections": 1,
  "message": "Collections listed successfully."
}

Use Cases#

Data Governance and Compliance#

Track ownership, business domain, and lifecycle status of collections for compliance and auditing requirements:

# Mark collections for different governance stages
ingestor.update_collection_metadata(
    collection_name="legal_contracts",
    status="Active",
    owner="Legal Team",
    business_domain="Legal"
)

Knowledge Base Organization#

Use tags and descriptions to organize and discover collections:

# Tag collections by project, team, or topic
ingestor.create_collection(
    collection_name="project_apollo_docs",
    description="Project Apollo Technical Documentation",
    tags=["apollo", "engineering", "technical", "2024"],
    business_domain="Engineering"
)

Lifecycle Management#

Manage collection lifecycles by updating status as collections evolve:

# Archive completed project documentation
ingestor.update_collection_metadata(
    collection_name="project_apollo_docs",
    status="Archived",
    tags=["apollo", "engineering", "technical", "2024", "completed"]
)

Content Analysis#

Use auto-populated metrics to understand collection content types:

# Query collections to find those with tables for structured data extraction
collections = ingestor.get_collections()
table_collections = [
    c for c in collections 
    if c['collection_info'].get('has_tables', False)
]

Data Catalog vs Custom Metadata#

The RAG Blueprint provides two complementary metadata systems:

Feature	Data Catalog (This Document)	Custom Metadata
Purpose	Collection/document management and governance	Document content filtering for retrieval
Scope	Entire collections and documents	Individual document chunks
Schema	Fixed catalog fields (description, tags, owner, etc.)	User-defined per collection (flexible)
Updates	Update anytime via PATCH endpoints	Set during ingestion only
Use Case	“Which collections does Finance own?”	“Show documents with priority > 5”
Filtering	Organization and discovery	Semantic search and retrieval

When to Use:

Use Data Catalog for collection organization, governance, and discovery
Use Custom Metadata for filtering document chunks during retrieval
Use Both together for comprehensive data management

Vector Database Support#

Data Catalog is supported on both Milvus and Elasticsearch with full feature parity:

Feature	Milvus	Elasticsearch
Collection Catalog Metadata	✅	✅
Document Catalog Metadata	✅	✅
Auto-Populated Metrics	✅	✅
Runtime Metadata Updates	✅	✅

API Reference#

For complete API specifications including request/response schemas and error codes, see the API - Ingestor Server Schema.

Endpoints#

POST /v1/collection: Create collection with catalog metadata
PATCH /v1/collections/{collection_name}/metadata: Update collection metadata
PATCH /v1/collections/{collection_name}/documents/{document_name}/metadata: Update document metadata
GET /v1/collections: Get all collections with catalog data

Data Catalog for NVIDIA RAG Blueprint#

Overview#

Collection Catalog Metadata#

Supported Fields#

Auto-Populated Content Metrics#

Creating Collections with Catalog Metadata#

Using the API#

Using the Python Client#

Updating Collection Metadata#

Document Catalog Metadata#

Updating Document Metadata#

Retrieving Collections with Catalog Data#

Using the API#

Example Response#

Use Cases#

Data Governance and Compliance#

Knowledge Base Organization#

Lifecycle Management#

Content Analysis#

Data Catalog vs Custom Metadata#

Vector Database Support#

API Reference#

Endpoints#

Related Documentation#