NeMo Retriever Extraction V2 API Guide: PDF Pre Splitting
TL;DR: V2 API automatically splits large PDFs into chunks for faster parallel processing.
Python: Enable with
message_client_kwargs={"api_version": "v2"}and configure chunk size with.pdf_split_config(pages_per_chunk=64).CLI: Use
--api_version v2 --pdf_split_page_count 64
Table of Contents
- Quick Start - Get running in 5 minutes
- Configuration Guide - All configuration options
- How It Works - Architecture overview
- Migration from V1 - Upgrade existing code
Quick Start
What is V2 API?
The V2 API automatically splits large PDFs into smaller chunks before processing, enabling:
- Higher throughput - 1.3-1.5x faster for large documents
- Better parallelization - Distribute work across Ray workers
- Configurable chunk sizes - Tune to your infrastructure (1-128 pages)
Minimal Example
from nv_ingest_client.client import Ingestor
# Two-step configuration
ingestor = Ingestor(
message_client_hostname="http://localhost",
message_client_port=7670,
message_client_kwargs={"api_version": "v2"} # ← Step 1: Enable V2
)
# Run with optional chunk size override
results = ingestor.files(["large_document.pdf"]) \
.extract(extract_text=True, extract_tables=True) \
.pdf_split_config(pages_per_chunk=64) \ # ← Step 2: Configure splitting
.ingest()
print(f"Processed {results['metadata']['total_pages']} pages")
CLI Usage
nv-ingest-cli \
--api_version v2 \
--pdf_split_page_count 64 \
--doc large_document.pdf \
--task 'extract:{"document_type":"pdf", "extract_text":true}' \
--output_directory ./results
That's it! PDFs larger than 64 pages will be automatically split and processed in parallel.
Configuration Guide
Two Required Settings
| Setting | Purpose | How to Set |
|---|---|---|
| API Version | Route requests to V2 endpoints | message_client_kwargs={"api_version": "v2"} |
| Chunk Size | Pages per chunk (optional) | .pdf_split_config(pages_per_chunk=N) |
Configuration Priority Chain
The chunk size is resolved in this order:
1. Client Override (HIGHEST) → .pdf_split_config(pages_per_chunk=64)
2. Server Environment Variable → PDF_SPLIT_PAGE_COUNT=64 in .env
3. Hardcoded Default (FALLBACK) → 32 pages
Client override always wins - useful for per-request tuning.
Option 1: Client-Side Configuration (Recommended)
# Full control over chunk size per request
ingestor = Ingestor(
message_client_kwargs={"api_version": "v2"}
).files(pdf_files) \
.extract(...) \
.pdf_split_config(pages_per_chunk=64) # Client override
Pros: - ✅ Different chunk sizes for different workloads - ✅ No server config changes needed - ✅ Clear intent in code
Cons: - ❌ Must specify in every request
Option 2: Server-Side Default
Set a cluster-wide default via Docker Compose .env:
# .env file
PDF_SPLIT_PAGE_COUNT=64
# docker-compose.yaml (already configured)
services:
nv-ingest-ms-runtime:
environment:
- PDF_SPLIT_PAGE_COUNT=${PDF_SPLIT_PAGE_COUNT:-32}
Pros: - ✅ Set once, applies to all clients - ✅ Different defaults per environment (dev/staging/prod) - ✅ Clients don't need to specify
Cons: - ❌ Requires server restart to change - ❌ Less flexible than client override
Option 3: Use the Default
Simply enable V2 without configuring chunk size:
# Uses default 32 pages per chunk
ingestor = Ingestor(
message_client_kwargs={"api_version": "v2"}
).files(pdf_files).extract(...).ingest()
Configuration Matrix
| Client Config | Server Env Var | Effective Chunk Size | Use Case |
|---|---|---|---|
.pdf_split_config(64) |
Not set | 64 | Client controls everything |
.pdf_split_config(128) |
PDF_SPLIT_PAGE_COUNT=32 |
128 | Client override wins |
| Not set | PDF_SPLIT_PAGE_COUNT=48 |
48 | Server default applies |
| Not set | Not set | 32 | Hardcoded fallback |
Choosing Chunk Size
Note: We are developing an auto-tuning system that will automatically select chunk sizes based on document characteristics, available resources, and historical performance. This will eliminate manual tuning for most use cases.
Smaller chunks (16-32 pages): - ✅ Maximum parallelism - ✅ Lower GPU memory per worker - ❌ More overhead from splitting/aggregation - Best for: Limited GPU memory, many available workers
Medium chunks (32-64 pages): - ✅ Balanced parallelism and overhead - ✅ Good for most workloads - Best for: General use (recommended starting point)
Larger chunks (64-128 pages): - ✅ Minimal overhead - ❌ Less parallelism - Best for: Very large datasets, fewer workers
Very large chunks (128+ pages): - ❌ Limited parallel benefits - Best for: Testing or when splitting overhead is problematic
Valid range: 1-128 pages (server enforces with clamping)
How It Works
Architecture Flow
Client API Layer (V2) Ray Workers
│ │ │
│ 1. Submit PDF │ │
├──────────────────────────► │ │
│ (200 pages) │ │
│ │ 2. Split into chunks │
│ │ (64 pages each) │
│ ├───────┐ │
│ │ │ Chunk 1 (1-64) │
│ │ │ Chunk 2 (65-128) │
│ │ │ Chunk 3 (129-192) │
│ │ └ Chunk 4 (193-200) │
│ │ │
│ │ 3. Process in parallel │
│ ├──────────────────────────►│
│ │ │ Worker A → Chunk 1
│ │ │ Worker B → Chunk 2
│ │ │ Worker C → Chunk 3
│ │ │ Worker D → Chunk 4
│ │ │
│ 4. Fetch result │ 5. Aggregate all chunks │
│ ◄──────────────────────────┼───────────────────────────┤
│ (all chunks combined) │ (ordered by page) │
Submission Phase
When you submit a PDF:
- Page Count Check: Server reads PDF metadata to get total pages
- Split Decision: If
page_count > pages_per_chunk, trigger splitting - Chunk Creation: Use
pypdfium2to split PDF into page ranges - Subjob Generation: Create subjobs with deterministic UUIDs
- Redis Storage: Store parent→subjob mapping with metadata
- Queue Submission: Submit all chunks to Ray task queue
Example:
Original: document.pdf (200 pages)
Config: pages_per_chunk=64
Chunks created:
- document.pdf#pages_1-64 (64 pages)
- document.pdf#pages_65-128 (64 pages)
- document.pdf#pages_129-192 (64 pages)
- document.pdf#pages_193-200 (8 pages)
Processing Phase
Each chunk is processed independently:
- Parallel execution across available Ray workers
- Full pipeline runs on each chunk (extraction, embedding, etc.)
- Per-chunk telemetry emitted to trace/annotations
- Results stored in Redis with subjob IDs
Fetch Phase
When you fetch results:
- Parent Check: API checks if job has subjobs
- State Verification: Checks all subjob states in parallel (batched)
- Wait if Needed: Returns 202 if any chunks still processing
- Fetch All Results: Fetches all subjob results in parallel (batched)
- Aggregate Data: Combines all chunk data in original page order
- Compute Metrics: Calculates parent-level trace aggregations
- Return Response: Single unified response with all chunks
Response Structure
Small PDFs (≤ chunk size):
{
"data": [...],
"trace": {...},
"annotations": {...},
"metadata": {
"total_pages": 15
}
}
Large PDFs (split into chunks):
{
"data": [...], // All chunks combined in page order
"status": "success",
"trace": {
// Parent-level aggregated metrics
"trace::entry::pdf_extractor": 1000,
"trace::exit::pdf_extractor": 5000,
"trace::resident_time::pdf_extractor": 800
},
"annotations": {...}, // Merged from all chunks
"metadata": {
"parent_job_id": "abc-123",
"total_pages": 200,
"pages_per_chunk": 64,
"original_source_id": "document.pdf",
"subjobs_failed": 0,
"chunks": [
{
"job_id": "chunk-1-uuid",
"chunk_index": 1,
"start_page": 1,
"end_page": 64,
"page_count": 64
}
// ... more chunks
],
"trace_segments": [
// Per-chunk trace details (for debugging)
]
}
}
Trace Aggregation
Parent-level metrics computed from chunk traces:
trace::entry::<stage>- Earliest entry across all chunks (when first chunk started)trace::exit::<stage>- Latest exit across all chunks (when last chunk finished)trace::resident_time::<stage>- Sum of all chunk durations (total compute time)
Example:
Chunk 1: entry=1000, exit=1100 → duration=100ms
Chunk 2: entry=2000, exit=2150 → duration=150ms
Chunk 3: entry=2100, exit=2300 → duration=200ms
Parent metrics:
entry = min(1000, 2000, 2100) = 1000 ← First chunk started
exit = max(1100, 2150, 2300) = 2300 ← Last chunk finished
resident_time = 100 + 150 + 200 = 450 ← Total compute
Wall-clock time = exit - entry = 1300ms (parallelization benefit!)
Compute time = resident_time = 450ms (actual work done)
Migration from V1
Minimal Migration (V2, No Splitting)
Smallest possible change - just route to V2 endpoints:
# Before (V1)
ingestor = Ingestor(
message_client_hostname="http://localhost",
message_client_port=7670
)
# After (V2, identical behavior for PDFs ≤32 pages)
ingestor = Ingestor(
message_client_hostname="http://localhost",
message_client_port=7670,
message_client_kwargs={"api_version": "v2"} # Only change
)
Behavior: No splitting occurs, responses identical to V1.
Full V2 with Splitting
Enable splitting for large PDFs:
# V2 with PDF splitting
ingestor = Ingestor(
message_client_hostname="http://localhost",
message_client_port=7670,
message_client_kwargs={"api_version": "v2"}
).files(pdf_files) \
.extract(extract_text=True, extract_tables=True) \
.pdf_split_config(pages_per_chunk=64) \
.ingest()
Test Script Pattern
For test scripts like tools/harness/src/nv_ingest_harness/cases/e2e.py:
import os
from nv_ingest_client.client import Ingestor
# Read from environment
api_version = os.getenv("API_VERSION", "v1")
pdf_split_page_count = int(os.getenv("PDF_SPLIT_PAGE_COUNT", "32"))
# Build ingestor kwargs
ingestor_kwargs = {
"message_client_hostname": f"http://{hostname}",
"message_client_port": 7670
}
# Enable V2 if configured
if api_version == "v2":
ingestor_kwargs["message_client_kwargs"] = {"api_version": "v2"}
# Create ingestor
ingestor = Ingestor(**ingestor_kwargs).files(data_dir)
# Configure splitting for V2
if api_version == "v2" and pdf_split_page_count:
ingestor = ingestor.pdf_split_config(pages_per_chunk=pdf_split_page_count)
# Continue with pipeline
ingestor = ingestor.extract(...).ingest()
Backward Compatibility
V1 clients continue to work:
- Still route to /v1/submit_job and /v1/fetch_job
- No changes required
- No splitting occurs
V2 responses are V1-compatible:
- Top-level data, trace, annotations have same structure
- Additional metadata in metadata object (ignored by V1 parsers)
- Existing response parsing code works unchanged
HTTP status codes:
| Code | Meaning | Action |
|---|---|---|
| 200 | All chunks complete | Parse results |
| 202 | Still processing | Poll again later |
| 404 | Job not found | Check job ID |
| 410 | Result consumed | Already fetched (destructive mode) |
| 500 | Server error | Check logs |
| 503 | Processing failed | Check failed_subjobs metadata |
Silent Clamping of Chunk Size
Symptom: Requested chunk size not used
Cause: Server clamps to valid range (1-128)
Check server logs for:
WARNING: Client requested split_page_count=1000; clamped to 128
Solution: Use values within 1-128 range
Response Fields
All PDFs:
- data - Array of extracted content
- trace - Trace metrics
- annotations - Task annotations
- metadata.total_pages - Total page count
Split PDFs only:
- metadata.parent_job_id - Parent job UUID
- metadata.pages_per_chunk - Configured chunk size
- metadata.chunks[] - Chunk descriptors
- metadata.trace_segments[] - Per-chunk traces
- metadata.failed_subjobs[] - Failed chunk details
Trace Metrics
Parent-level (all jobs):
- trace::entry::<stage> - Earliest start time
- trace::exit::<stage> - Latest finish time
- trace::resident_time::<stage> - Total compute time
Chunk-level (split jobs only):
- metadata.trace_segments[].trace - Per-chunk traces
Key Files
Server Implementation:
- src/nv_ingest/api/v2/ingest.py - V2 endpoints
- src/nv_ingest/framework/util/service/impl/ingest/redis_ingest_service.py - Redis state management
Client Implementation:
- client/src/nv_ingest_client/client/interface.py - Ingestor class
- client/src/nv_ingest_client/util/util.py - Configuration utilities
- client/src/nv_ingest_client/client/ingest_job_handler.py - Job handling
Schemas:
- api/src/nv_ingest_api/internal/schemas/meta/ingest_job_schema.py - PdfConfigSchema
FAQ
Q: Do I need to specify chunk size every time?
A: No. If you don't call .pdf_split_config(), the server uses either the PDF_SPLIT_PAGE_COUNT env var or the hardcoded default (32 pages).
Q: When does splitting actually occur?
A: Only when page_count > pages_per_chunk. Smaller PDFs are processed as single jobs (no overhead).
Q: Will my V1 response parsing code work with V2?
A: Yes! Top-level data, trace, and annotations fields are identical. Additional metadata is added under metadata (which V1 parsers ignore).
Q: How do I know if splitting occurred?
A: Check len(result["metadata"].get("chunks", [])) > 0 or look for server logs: "Splitting PDF ... into ... chunks".
Q: What happens if one chunk fails?
A: Other chunks still return results. Check metadata.failed_subjobs[] for details. The job returns status: "failed" but includes partial results.
Q: Does V2 work without splitting?
A: Yes! Just enable V2 without calling .pdf_split_config(). PDFs ≤ default chunk size behave identically to V1.