Data Catalog for NVIDIA RAG Blueprint#

The Data Catalog feature enables comprehensive metadata management for collections and documents in the NVIDIA RAG Blueprint. This feature provides organization, governance, and discovery capabilities for your knowledge base through collection-level and document-level catalog metadata.

After you have deployed the blueprint, the Data Catalog endpoints are automatically available. No additional configuration is required.

Overview#

The Data Catalog provides two types of metadata:

  1. Collection Catalog Metadata: Organizational metadata for entire collections (description, tags, owner, business domain, status)

  2. Document Catalog Metadata: Metadata for individual documents within collections (description, tags)

Additionally, the system automatically populates content metrics such as number_of_files, has_tables, has_charts, and has_images to help you understand what content each collection contains.

Collection Catalog Metadata#

Supported Fields#

Field

Type

Description

Example

description

string

Human-readable description of the collection

“Q4 2024 Financial Reports”

tags

array[string]

Tags for categorization and discovery

["finance", "q4-2024"]

owner

string

Team or person responsible

“Finance Team”

created_by

string

User who created the collection

“john.doe@company.com”

business_domain

string

Business domain or department

“Finance”, “Legal”, “Engineering”

status

string

Collection lifecycle status

“Active”, “Archived”, “Deprecated”

date_created

timestamp

Automatically set on creation

“2024-11-18T10:30:00+00:00”

last_updated

timestamp

Automatically updated on changes

“2024-11-18T15:45:00+00:00”

Auto-Populated Content Metrics#

The system automatically analyzes ingested content and provides these metrics:

Metric

Type

Description

number_of_files

integer

Total number of documents in the collection

last_indexed

timestamp

Last time documents were ingested

ingestion_status

string

Current ingestion status

has_tables

boolean

Whether collection contains table content

has_charts

boolean

Whether collection contains charts/diagrams

has_images

boolean

Whether collection contains images

Creating Collections with Catalog Metadata#

Using the API#

import requests

url = "http://localhost:8082/v1/collection"

data = {
    "collection_name": "financial_reports_2024",
    "embedding_dimension": 2048,
    "description": "Q4 2024 Financial Reports and Analysis",
    "tags": ["finance", "reports", "q4-2024"],
    "owner": "Finance Team",
    "created_by": "john.doe@company.com",
    "business_domain": "Finance",
    "status": "Active",
    "metadata_schema": []  # Add custom metadata schema if needed
}

response = requests.post(url, json=data)
print(response.json())

Using the Python Client#

from nvidia_rag import NvidiaRAGIngestor

ingestor = NvidiaRAGIngestor()

result = ingestor.create_collection(
    collection_name="financial_reports_2024",
    vdb_endpoint="http://localhost:19530",
    description="Q4 2024 Financial Reports and Analysis",
    tags=["finance", "reports", "q4-2024"],
    owner="Finance Team",
    created_by="john.doe@company.com",
    business_domain="Finance",
    status="Active"
)

Note

All catalog metadata fields are optional. If not provided, they will be empty strings or empty arrays by default.

Updating Collection Metadata#

You can update collection catalog metadata at any time without re-ingesting documents:

import requests

url = "http://localhost:8082/v1/collections/financial_reports_2024/metadata"

updates = {
    "description": "Q4 2024 Financial Reports - Final Version",
    "tags": ["finance", "reports", "q4-2024", "final", "approved"],
    "status": "Archived",
    "business_domain": "Finance"
}

response = requests.patch(url, json=updates)
print(response.json())

Note

The PATCH endpoint performs a merge update. Only provided fields are updated; omitted fields retain their current values.

Document Catalog Metadata#

Updating Document Metadata#

After ingesting documents, you can add descriptive metadata to individual documents:

import requests

url = "http://localhost:8082/v1/collections/financial_reports_2024/documents/annual_report.pdf/metadata"

updates = {
    "description": "Annual Financial Report 2024 - Comprehensive Overview",
    "tags": ["annual", "comprehensive", "board-approved"]
}

response = requests.patch(url, json=updates)
print(response.json())

Retrieving Collections with Catalog Data#

Using the API#

import requests

url = "http://localhost:8082/v1/collections"
response = requests.get(url)
result = response.json()

for collection in result.get("collections", []):
    info = collection.get('collection_info', {})
    print(f"Collection: {collection['collection_name']}")
    print(f"  Description: {info.get('description', 'N/A')}")
    print(f"  Tags: {info.get('tags', [])}")
    print(f"  Owner: {info.get('owner', 'N/A')}")
    print(f"  Status: {info.get('status', 'N/A')}")
    print(f"  Files: {info.get('number_of_files', 0)}")
    print(f"  Has Tables: {info.get('has_tables', False)}")
    print(f"  Has Charts: {info.get('has_charts', False)}")
    print(f"  Has Images: {info.get('has_images', False)}")
    print()

Example Response#

{
  "collections": [
    {
      "collection_name": "financial_reports_2024",
      "num_entities": 1250,
      "metadata_schema": [],
      "collection_info": {
        "description": "Q4 2024 Financial Reports - Final Version",
        "tags": ["finance", "reports", "q4-2024", "final"],
        "owner": "Finance Team",
        "created_by": "john.doe@company.com",
        "business_domain": "Finance",
        "status": "Archived",
        "date_created": "2024-11-18T10:30:00+00:00",
        "last_updated": "2024-11-18T15:45:00+00:00",
        "number_of_files": 15,
        "last_indexed": "2024-11-18T14:20:00+00:00",
        "ingestion_status": "completed",
        "has_tables": true,
        "has_charts": true,
        "has_images": false
      }
    }
  ],
  "total_collections": 1,
  "message": "Collections listed successfully."
}

Use Cases#

Data Governance and Compliance#

Track ownership, business domain, and lifecycle status of collections for compliance and auditing requirements:

# Mark collections for different governance stages
ingestor.update_collection_metadata(
    collection_name="legal_contracts",
    status="Active",
    owner="Legal Team",
    business_domain="Legal"
)

Knowledge Base Organization#

Use tags and descriptions to organize and discover collections:

# Tag collections by project, team, or topic
ingestor.create_collection(
    collection_name="project_apollo_docs",
    description="Project Apollo Technical Documentation",
    tags=["apollo", "engineering", "technical", "2024"],
    business_domain="Engineering"
)

Lifecycle Management#

Manage collection lifecycles by updating status as collections evolve:

# Archive completed project documentation
ingestor.update_collection_metadata(
    collection_name="project_apollo_docs",
    status="Archived",
    tags=["apollo", "engineering", "technical", "2024", "completed"]
)

Content Analysis#

Use auto-populated metrics to understand collection content types:

# Query collections to find those with tables for structured data extraction
collections = ingestor.get_collections()
table_collections = [
    c for c in collections 
    if c['collection_info'].get('has_tables', False)
]

Data Catalog vs Custom Metadata#

The RAG Blueprint provides two complementary metadata systems:

Feature

Data Catalog (This Document)

Custom Metadata

Purpose

Collection/document management and governance

Document content filtering for retrieval

Scope

Entire collections and documents

Individual document chunks

Schema

Fixed catalog fields (description, tags, owner, etc.)

User-defined per collection (flexible)

Updates

Update anytime via PATCH endpoints

Set during ingestion only

Use Case

“Which collections does Finance own?”

“Show documents with priority > 5”

Filtering

Organization and discovery

Semantic search and retrieval

When to Use:

  • Use Data Catalog for collection organization, governance, and discovery

  • Use Custom Metadata for filtering document chunks during retrieval

  • Use Both together for comprehensive data management

Vector Database Support#

Data Catalog is supported on both Milvus and Elasticsearch with full feature parity:

Feature

Milvus

Elasticsearch

Collection Catalog Metadata

Document Catalog Metadata

Auto-Populated Metrics

Runtime Metadata Updates

API Reference#

For complete API specifications including request/response schemas and error codes, see the API - Ingestor Server Schema.

Endpoints#

  • POST /v1/collection: Create collection with catalog metadata

  • PATCH /v1/collections/{collection_name}/metadata: Update collection metadata

  • PATCH /v1/collections/{collection_name}/documents/{document_name}/metadata: Update document metadata

  • GET /v1/collections: Get all collections with catalog data