Is this page helpful?

nv-ingest Integration Testing Framework

A configurable, dataset-agnostic testing framework for end-to-end validation of nv-ingest pipelines. This framework uses structured YAML configuration for type safety, validation, and parameter management.

Dataset Prerequisites

Before you run any benchmarking or evaluation tests, you must first download the benchmark datasets. The three primary datasets used in nv-ingest benchmarking and evaluations are the following:

Bo20 - 20 PDFs for quick testing
Bo767 - 767 PDFs for comprehensive benchmarking
Bo10k - 10,000 PDFs for large-scale evaluations

How to Download the Datasets

Use the Digital Corpora Download Notebook to download these datasets from the public Digital Corpora source. This notebook provides automated download functions that do the following:

Download PDFs directly from Digital Corpora's public repository.
Support all three dataset sizes (Bo20, Bo767, Bo10k).
Handle the extraction and organization of files automatically.

Important

These datasets are prerequisites for all benchmarking operations described in this guide. Complete the dataset download before proceeding with any test scenarios below.

Quick Start

Prerequisites

Before you use this documentation, you need the following:

Docker and Docker Compose are running
A Python environment with nv-ingest-client installed
The benchmark datasets are downloaded

Run Your First Test

# 1. Navigate to the nv-ingest-harness directory
cd tools/harness

# 2. Install dependencies
uv sync

# 3. Run with a pre-configured dataset (assumes services are running)
uv run nv-ingest-harness-run --case=e2e --dataset=bo767

# Or use a custom path that uses the "active" configuration
uv run nv-ingest-harness-run --case=e2e --dataset=/path/to/your/data

# With managed infrastructure (starts/stops services)
uv run nv-ingest-harness-run --case=e2e --dataset=bo767 --managed

Configuration System

YAML Configuration (`test_configs.yaml`)

The framework uses a structured YAML file for all test configuration. Configuration is organized into logical sections:

Active Configuration

The active section contains your current test settings. Edit these values directly for your test runs:

active:
  # Dataset
  dataset_dir: /path/to/your/dataset
  test_name: null  # Auto-generated if null

  # API Configuration
  api_version: v1  # v1 or v2
  pdf_split_page_count: null  # V2 only: pages per chunk (null = default 32)

  # Infrastructure
  hostname: localhost
  readiness_timeout: 600
  profiles: [retrieval]

  # Runtime
  sparse: true
  gpu_search: false
  embedding_model: auto

  # Extraction
  extract_text: true
  extract_tables: true
  extract_charts: true
  extract_images: false
  extract_infographics: true
  text_depth: page
  table_output_format: markdown

  # Pipeline (optional steps)
  enable_caption: false
  enable_split: false
  split_chunk_size: 1024
  split_chunk_overlap: 150

  # Storage
  spill_dir: /tmp/spill
  artifacts_dir: null
  collection_name: null

Pre-Configured Datasets

Each dataset includes its path, extraction settings, and recall evaluator in one place:

datasets:
  bo767:
    path: /raid/jioffe/bo767
    extract_text: true
    extract_tables: true
    extract_charts: true
    extract_images: false
    extract_infographics: false
    recall_dataset: bo767  # Evaluator for recall testing

  bo20:
    path: /raid/jioffe/bo20
    extract_text: true
    extract_tables: true
    extract_charts: true
    extract_images: true
    extract_infographics: false
    recall_dataset: null  # bo20 does not have recall

  earnings:
    path: /raid/jioffe/earnings_conusulting
    extract_text: true
    extract_tables: true
    extract_charts: true
    extract_images: false
    extract_infographics: false
    recall_dataset: earnings  # Evaluator for recall testing

Automatic Configuration: When you use --dataset=bo767, the framework automatically: - Sets the dataset path - Applies the correct extraction settings (text, tables, charts, images, infographics) - Configures the recall evaluator (if applicable)

Usage:

# Single dataset - configs applied automatically
uv run nv-ingest-harness-run --case=e2e --dataset=bo767

# Multiple datasets (sweeping) - each gets its own config
uv run nv-ingest-harness-run --case=e2e --dataset=bo767,earnings,bo20

# Custom path still works (uses active section config)
uv run nv-ingest-harness-run --case=e2e --dataset=/custom/path

Dataset Extraction Settings:

Dataset	Text	Tables	Charts	Images	Infographics	Recall
`bo767`	✅	✅	✅	❌	❌	✅
`earnings`	✅	✅	✅	❌	❌	✅
`bo20`	✅	✅	✅	✅	❌	❌
`financebench`	✅	✅	✅	✅	✅	✅
`single`	✅	✅	✅	✅	✅	❌

Configuration Precedence

Settings are applied in order of priority:

Environment variables > Dataset-specific config (path + extraction + recall_dataset) > YAML active config

Note: CLI arguments are only used for runtime decisions (which test to run, which dataset, execution mode). All configuration values come from YAML or environment variables.

Example:

# YAML active section has api_version: v1
# Dataset bo767 has extract_images: false
# Override via environment variable (highest priority)
EXTRACT_IMAGES=true API_VERSION=v2 uv run nv-ingest-harness-run --case=e2e --dataset=bo767
# Result: Uses bo767 path, but extract_images=true (env override) and api_version=v2 (env override)

Precedence Details: 1. Environment variables - Highest priority, useful for CI/CD overrides 2. Dataset-specific config - Applied automatically when using --dataset=<name> - Includes: path, extraction settings, recall_dataset - Only applies if dataset is defined in datasets section 3. YAML active config - Base configuration, used as fallback

Configuration Options Reference

Core Options

dataset_dir (string, required): Path to dataset directory
test_name (string): Test identifier (auto-generated if null)
api_version (string): API version - v1 or v2
pdf_split_page_count (integer): PDF splitting page count (V2 only, 1-128)

Extraction Options

extract_text, extract_tables, extract_charts, extract_images, extract_infographics (boolean): Content extraction toggles
text_depth (string): Text extraction granularity - page, document, block, line, etc.
table_output_format (string): Table output format - markdown, html, latex, pseudo_markdown, simple

Pipeline Options

enable_caption (boolean): Enable image captioning (requires the VLM profile to be running)
caption_prompt (string): Override the user prompt sent to the captioning VLM. Defaults to "Caption the content of this image:".
caption_reasoning (boolean): Enable reasoning mode for the captioning VLM. True enables reasoning, False disables reasoning. Defaults to null (service default, typically disabled).
enable_split (boolean): Enable text chunking
split_chunk_size (integer): Chunk size for text splitting
split_chunk_overlap (integer): Overlap for text splitting

Infrastructure Options

hostname (string): Service hostname
readiness_timeout (integer): Docker startup timeout in seconds
profiles (list): Docker compose profiles

Runtime Options

sparse (boolean): Use sparse embeddings
gpu_search (boolean): Use GPU for search
embedding_model (string): Embedding model name (auto for auto-detection)
llm_summarization_model (string): LLM model for summarization (used by e2e_with_llm_summary)

Storage Options

spill_dir (string): Temporary processing directory
artifacts_dir (string): Test output directory (auto-generated if null)
collection_name (string): Milvus collection name (auto-generated as {test_name}_multimodal if null, deterministic - no timestamp)

Valid Configuration Values

text_depth: block, body, document, header, line, nearby_block, other, page, span

table_output_format: html, image, latex, markdown, pseudo_markdown, simple

api_version: v1, v2

Configuration is validated on load with helpful error messages.

Running Tests

Basic Usage

# Run with default YAML configuration (assumes services are running)
uv run nv-ingest-harness-run --case=e2e --dataset=bo767

# With document-level analysis
uv run nv-ingest-harness-run --case=e2e --dataset=bo767 --doc-analysis

# With managed infrastructure (starts/stops services)
uv run nv-ingest-harness-run --case=e2e --dataset=bo767 --managed

Dataset Sweeping

Run multiple datasets in a single command - each dataset automatically gets its native extraction configuration:

# Sweep multiple datasets
uv run nv-ingest-harness-run --case=e2e --dataset=bo767,earnings,bo20

# Each dataset runs sequentially with its own:
# - Extraction settings (from dataset config)
# - Artifact directory (timestamped per dataset)
# - Results summary at the end

# With managed infrastructure (services start once, shared across all datasets)
uv run nv-ingest-harness-run --case=e2e --dataset=bo767,earnings,bo20 --managed

# E2E+Recall sweep (each dataset ingests then evaluates recall)
uv run nv-ingest-harness-run --case=e2e_recall --dataset=bo767,earnings

# Recall-only sweep (evaluates existing collections)
uv run nv-ingest-harness-run --case=recall --dataset=bo767,earnings

Sweep Behavior: - Services start once (if --managed) before the sweep - Each dataset gets its own artifact directory - Each dataset automatically applies its extraction config from datasets section - Summary printed at end showing success/failure for each dataset - Services stop once at end (unless --keep-up)

Using Environment Variables

# Override via environment (useful for CI/CD)
API_VERSION=v2 EXTRACT_TABLES=false uv run nv-ingest-harness-run --case=e2e

# Temporary changes without editing YAML
DATASET_DIR=/custom/path uv run nv-ingest-harness-run --case=e2e

Test Scenarios

Available Tests

Name	Description	Configuration Needed	Status
`e2e`	Dataset-agnostic E2E ingestion	`active` section only	✅ Primary (YAML config)
`e2e_with_llm_summary`	E2E with LLM summarization via UDF	`active` section only	✅ Available (YAML config)
`recall`	Recall evaluation against existing collections	`active` + `recall` sections	✅ Available (YAML config)
`e2e_recall`	Fresh ingestion + recall evaluation	`active` + `recall` sections	✅ Available (YAML config)

Note: Legacy test cases (dc20_e2e, dc20_v2_e2e) have been moved to scripts/private_local.

Configuration Synergy

For E2E-only users: - Only configure active section - collection_name in active: auto-generates from test_name or dataset basename if null (deterministic, no timestamp) - Collection name pattern: {test_name}_multimodal (e.g., bo767_multimodal, earnings_consulting_multimodal) - recall section is optional (not used unless running recall tests) - Note: You can run recall later against the same collection created by e2e

For Recall-only users: - Configure active section: hostname, sparse, gpu_search, etc. (for evaluation) - Configure recall section: recall_dataset (required) - Set test_name in active to match your existing collection (collection must be {test_name}_multimodal) - collection_name in active is ignored (recall generates {test_name}_multimodal)

For E2E+Recall users: - Configure active section: dataset_dir, test_name, extraction settings, etc. - Configure recall section: recall_dataset (required) - Collection naming: e2e_recall automatically creates {test_name}_multimodal collection - collection_name in active is ignored (e2e_recall forces {test_name}_multimodal pattern)

Example Configurations

V2 API with PDF Splitting:

# Edit test_configs.yaml active section:
active:
  api_version: v2
  pdf_split_page_count: 32
  extract_text: true
  extract_tables: true
  extract_charts: true

Text-Only Processing:

active:
  extract_text: true
  extract_tables: false
  extract_charts: false
  extract_images: false
  extract_infographics: false

RAG with Text Chunking:

active:
  enable_split: true
  split_chunk_size: 1024
  split_chunk_overlap: 150

Multimodal with Image Extraction:

active:
  extract_text: true
  extract_tables: true
  extract_charts: true
  extract_images: true
  extract_infographics: true
  enable_caption: true

Recall Testing

Recall testing evaluates retrieval accuracy against ground truth query sets. Two test cases are available:

Test Cases

recall - Recall-only evaluation against existing collections: - Skips ingestion (assumes collections already exist) - Loads existing collections from Milvus - Evaluates recall using multimodal queries (all datasets are multimodal-only) - Supports reranker comparison (no reranker, with reranker, or reranker-only)

e2e_recall - Fresh ingestion + recall evaluation: - Performs full ingestion pipeline - Creates multimodal collection during ingestion - Evaluates recall immediately after ingestion - Combines ingestion metrics with recall metrics

Reranker Configuration

Three modes via reranker_mode setting:

No reranker (default): reranker_mode: none
Runs evaluation without reranker only
Both modes: reranker_mode: both
Runs evaluation twice: once without reranker, once with reranker
Useful for comparing reranker impact
Reranker only: reranker_mode: with
Runs evaluation with reranker only
Faster when you only need reranked results

Collection Naming

Deterministic Collection Names (No Timestamps)

All test cases use deterministic collection names (no timestamps) to enable: - Reusing collections across test runs - Running recall evaluation after e2e ingestion - Consistent collection naming patterns

Collection Name Patterns:

All test cases use the same consistent pattern: {test_name}_multimodal

Test Case	Pattern	Example
`e2e`	`{test_name}_multimodal`	`bo767_multimodal`
`e2e_with_llm_summary`	`{test_name}_multimodal`	`bo767_multimodal`
`e2e_recall`	`{test_name}_multimodal`	`bo767_multimodal`
`recall`	`{test_name}_multimodal`	`bo767_multimodal`

Benefits: - ✅ Run e2e then recall separately - they use the same collection - ✅ Consistent naming across all test cases - ✅ Deterministic names (no timestamps) enable collection reuse

Recall Collections: - A single multimodal collection is created for recall evaluation - Pattern: {test_name}_multimodal - Example: bo767_multimodal - All datasets evaluate against this multimodal collection (no modality-specific collections)

Note: Artifact directories still use timestamps for tracking over time (e.g., bo767_20251106_180859_UTC), but collection names are deterministic.

Multimodal-Only Evaluation

All datasets use multimodal-only evaluation: - Ground truth queries contain all content types (text, tables, charts) - Single collection contains all extracted content types - Simplified evaluation interface (no modality filtering)

Ground Truth Files

bo767 dataset: - Ground truth file: bo767_query_gt.csv (consolidated multimodal queries) - Located in repo data/ directory - Default ground_truth_dir: null automatically uses data/ directory - Custom path can be specified via ground_truth_dir config

Other datasets (finance_bench, earnings, audio): - Ground truth files must be obtained separately (not in public repo) - Set ground_truth_dir to point to your ground truth directory - Dataset-specific evaluators are extensible (see recall_utils.py)

Configuration

Edit the recall section in test_configs.yaml:

recall:
  # Reranker configuration
  reranker_mode: none  # Options: "none", "with", "both"

  # Recall evaluation settings
  recall_top_k: 10
  ground_truth_dir: null  # null = use repo data/ directory
  recall_dataset: bo767  # Required: must be explicitly set (bo767, finance_bench, earnings, audio)

Usage Examples

Recall-only (existing collections):

# Evaluate existing bo767 collections (no reranker)
# recall_dataset automatically set from dataset config
uv run nv-ingest-harness-run --case=recall --dataset=bo767

# With reranker only (set reranker_mode in YAML recall section)
uv run nv-ingest-harness-run --case=recall --dataset=bo767

# Sweep multiple datasets for recall evaluation
uv run nv-ingest-harness-run --case=recall --dataset=bo767,earnings

E2E + Recall (fresh ingestion):

# Fresh ingestion with recall evaluation
# recall_dataset automatically set from dataset config
uv run nv-ingest-harness-run --case=e2e_recall --dataset=bo767

# Sweep multiple datasets (each ingests then evaluates)
uv run nv-ingest-harness-run --case=e2e_recall --dataset=bo767,earnings

Dataset configuration: - Dataset path: Automatically set from datasets section when using --dataset=<name> - Extraction settings: Automatically applied from datasets section - recall_dataset: Automatically set from datasets section (e.g., bo767, earnings, finance_bench) - Can be overridden via environment variable: RECALL_DATASET=bo767 - test_name: Auto-generated from dataset name or basename of path (can set in YAML active section) - Collection naming: {test_name}_multimodal (automatically generated for recall cases) - All datasets evaluate against the same {test_name}_multimodal collection (multimodal-only)

Output

Recall results are included in results.json:

{
  "recall_results": {
    "no_reranker": {
      "1": 0.554,
      "3": 0.746,
      "5": 0.807,
      "10": 0.857
    },
    "with_reranker": {
      "1": 0.601,
      "3": 0.781,
      "5": 0.832,
      "10": 0.874
    }
  }
}

Metrics are also logged via kv_event_log(): - recall_multimodal_@{k}_no_reranker - recall_multimodal_@{k}_with_reranker - recall_eval_time_s_no_reranker - recall_eval_time_s_with_reranker

Sweeping Parameters

Dataset Sweeping (Recommended)

The easiest way to test multiple datasets is using dataset sweeping:

# Test multiple datasets - each gets its native config automatically
uv run nv-ingest-harness-run --case=e2e --dataset=bo767,earnings,bo20

# Each dataset runs with its pre-configured extraction settings
# Results are organized in separate artifact directories

Parameter Sweeping

To sweep through different parameter values:

Edit test_configs.yaml - Update values in the active section
Run the test: uv run nv-ingest-harness-run --case=e2e --dataset=<name>
Analyze results in artifacts/<test_name>_<timestamp>/
Repeat steps 1-3 for next parameter combination

Example parameter sweep workflow:

# Test 1: Baseline V1
vim test_configs.yaml  # Set: api_version=v1, extract_tables=true
uv run nv-ingest-harness-run --case=e2e --dataset=bo767

# Test 2: V2 with 32-page splitting
vim test_configs.yaml  # Set: api_version=v2, pdf_split_page_count=32
uv run nv-ingest-harness-run --case=e2e --dataset=bo767

# Test 3: V2 with 8-page splitting
vim test_configs.yaml  # Set: pdf_split_page_count=8
uv run nv-ingest-harness-run --case=e2e --dataset=bo767

# Test 4: Tables disabled (override via env var)
EXTRACT_TABLES=false uv run nv-ingest-harness-run --case=e2e --dataset=bo767

Note: Each test run creates a new timestamped artifact directory, so you can compare results across sweeps.

Execution Modes

Attach Mode (Default)

uv run nv-ingest-harness-run --case=e2e --dataset=bo767

Default behavior: Assumes services are already running
Runs test case only (no service management)
Faster for iterative testing
Use when Docker services are already up
--no-build and --keep-up flags are ignored in attach mode

Managed Mode

uv run nv-ingest-harness-run --case=e2e --dataset=bo767 --managed

Starts Docker services automatically
Waits for service readiness (configurable timeout)
Runs test case
Collects artifacts
Stops services after test (unless --keep-up)

Managed mode options:

# Skip Docker image rebuild (faster startup)
uv run nv-ingest-harness-run --case=e2e --dataset=bo767 --managed --no-build

# Keep services running after test (useful for multi-test scenarios)
uv run nv-ingest-harness-run --case=e2e --dataset=bo767 --managed --keep-up

Artifacts and Logging

All test outputs are collected in timestamped directories:

tools/harness/artifacts/<test_name>_<timestamp>_UTC/
├── results.json         # Consolidated test metadata and results
├── stdout.txt          # Complete test output
└── e2e.json            # Structured metrics and events

Note: Artifact directories use timestamps for tracking test runs over time, while collection names are deterministic (no timestamps) to enable collection reuse and recall evaluation.

Results Structure

results.json contains: - Runner metadata: case name, timestamp, git commit, infrastructure mode - Test configuration: API version, extraction settings, dataset info - Test results: chunks created, timing, performance metrics

Document Analysis

Enable per-document element breakdown:

uv run nv-ingest-harness-run --case=e2e --doc-analysis

Sample Output:

Document Analysis:
  document1.pdf: 44 elements (text: 15, tables: 13, charts: 15, images: 0, infographics: 1)
  document2.pdf: 14 elements (text: 9, tables: 0, charts: 4, images: 0, infographics: 1)

This provides: - Element counts by type for each document - Useful for understanding dataset characteristics - Helps identify processing bottlenecks - Validates extraction completeness

Architecture

Framework Components

1. Configuration Layer - test_configs.yaml - Structured configuration file - Active test configuration (edit directly) - Dataset shortcuts for quick access - src/nv_ingest_harness/config.py - Configuration management - YAML loading and parsing - Type-safe config dataclass - Validation logic with helpful errors - Environment variable override support

2. Test Runner - src/nv_ingest_harness/cli/run.py - Main orchestration - Configuration loading with precedence chain - Docker service management (managed mode) - Test case execution with config injection - Artifact collection and consolidation

3. Test Cases - src/nv_ingest_harness/cases/e2e.py - Primary E2E test (✅ YAML-based) - Accepts config object directly - Type-safe parameter access - Full pipeline validation (extract → embed → VDB → retrieval) - Transparent configuration logging - cases/e2e_with_llm_summary.py - E2E with LLM (✅ YAML-based) - Adds UDF-based LLM summarization - Same config-based architecture as e2e.py - src/nv_ingest_harness/cases/recall.py - Recall evaluation (✅ YAML-based) - Evaluates retrieval accuracy against existing collections - Requires recall_dataset in config (from dataset config or env var) - Supports reranker comparison modes (none, with, both) - Multimodal-only evaluation against {test_name}_multimodal collection - src/nv_ingest_harness/cases/e2e_recall.py - E2E + Recall (✅ YAML-based) - Combines ingestion (via e2e.py) with recall evaluation (via recall.py) - Automatically creates collection during ingestion - Requires recall_dataset in config (from dataset config or env var) - Merges ingestion and recall metrics in results

4. Shared Utilities - src/nv_ingest_harness/utils/interact.py - Common testing utilities - embed_info() - Embedding model detection - milvus_chunks() - Vector database statistics - segment_results() - Result categorization by type - kv_event_log() - Structured logging - pdf_page_count() - Dataset page counting

Configuration Flow

test_configs.yaml → load_config() → TestConfig object → test case
    (active +        (applies          (validated,
     datasets)        dataset config)    type-safe)
         ↑                    ↑
    Env overrides      Dataset configs
    (highest)          (auto-applied)

Configuration Loading: 1. Start with active section from YAML 2. If --dataset=<name> matches a configured dataset: - Apply dataset path - Apply dataset extraction settings - Apply dataset recall_dataset (if set) 3. Apply environment variable overrides (if any) 4. Validate and create TestConfig object

All test cases receive a validated TestConfig object with typed fields, eliminating string parsing errors.

Development Guide

Adding New Test Cases

Create test script in cases/ directory

Accept config parameter:

def main(config, log_path: str = "test_results") -> int:
    """
    Test case entry point.

    Args:
        config: TestConfig object with all settings
        log_path: Path for structured logging

    Returns:
        Exit code (0 = success)
    """
    # Access config directly (type-safe)
    data_dir = config.dataset_dir
    api_version = config.api_version
    extract_text = config.extract_text
    # ...

Add transparent logging:

print("=== Test Configuration ===")
print(f"Dataset: {config.dataset_dir}")
print(f"API: {config.api_version}")
print(f"Extract: text={config.extract_text}, tables={config.extract_tables}")
print("=" * 60)

Use structured logging:

from interact import kv_event_log

kv_event_log("ingestion_time_s", elapsed_time, log_path)
kv_event_log("text_chunks", num_text_chunks, log_path)

Register case in run.py:

CASES = ["e2e", "e2e_with_llm_summary", "your_new_case"]

Extending Configuration

To add new configurable parameters:

Add to TestConfig dataclass in config.py:

@dataclass
class TestConfig:
    # ... existing fields
    new_param: bool = False  # Add with type and default

Add to YAML active section:

active:
  # ... existing config
  new_param: false  # Match Python default

Add environment variable mapping in config.py (if needed):

env_mapping = {
    # ... existing mappings
    "NEW_PARAM": ("new_param", parse_bool),
}

Add validation (if needed) in TestConfig.validate():

def validate(self) -> List[str]:
    errors = []
    # ... existing validation
    if self.new_param and self.some_other_field is None:
        errors.append("new_param requires some_other_field to be set")
    return errors

Update this README with parameter description

Testing Different Datasets

The framework is dataset-agnostic and supports multiple approaches:

Option 1: Use pre-configured dataset (Recommended)

# Dataset configs automatically applied
uv run nv-ingest-harness-run --case=e2e --dataset=bo767

Option 2: Add new dataset to YAML

datasets:
  my_dataset:
    path: /path/to/your/dataset
    extract_text: true
    extract_tables: true
    extract_charts: true
    extract_images: false
    extract_infographics: false
    recall_dataset: null  # or set to evaluator name if applicable

Then use: uv run nv-ingest-harness-run --case=e2e --dataset=my_dataset

Option 3: Use custom path (uses active section config)

uv run nv-ingest-harness-run --case=e2e --dataset=/path/to/your/dataset

Option 4: Environment variable override

# Override specific settings via env vars
EXTRACT_IMAGES=true uv run nv-ingest-harness-run --case=e2e --dataset=bo767

Best Practice: For repeated testing, add your dataset to the datasets section with its native extraction settings. This ensures consistent configuration and enables dataset sweeping.

Additional Resources

Configuration: See config.py for complete field list and validation logic
Test utilities: See interact.py for shared helper functions
Docker setup: See project root README for service management commands
API documentation: See docs/ for API version differences

The framework prioritizes clarity, type safety, and validation to support reliable testing of nv-ingest pipelines.