Is this page helpful?

Python API Reference

The NeMo Retriever Library Python API provides a simple and flexible interface for processing and extracting information from various document types, including PDFs.

Note

NVIDIA Ingest (nv-ingest) has been renamed to the NeMo Retriever Library.

Tip

There is a Jupyter notebook available to help you get started with the Python API. For more information, refer to Python Client Quick Start Guide.

Summary of Key Methods

The main class in the NeMo Retriever Library Python API is Ingestor. The Ingestor class provides an interface for building, managing, and running data ingestion jobs, enabling for chainable task additions and job state tracking.

Ingestor Methods

The following table describes methods of the Ingestor class.

Method	Description
`all_tasks`	Add a default set of tasks (extract, dedup, filter, split, embed, store_embed).
`buffers`	Add in-memory BytesIO buffers for processing (name, BytesIO pairs).
`caption`	Extract captions from images within the document.
`cancelled_jobs`	Return the count of jobs in the CANCELLED state.
`completed_jobs`	Return the count of jobs in the COMPLETED state.
`dedup`	Add a deduplication task (e.g. bbox dedup for structured + image extraction).
`embed`	Generate embeddings from extracted content.
`extract`	Add an extraction task (text, tables, charts, infographics).
`failed_jobs`	Return the count of jobs in the FAILED state.
`files`	Add document paths for processing.
`filter`	Add a filter task (e.g. by size/aspect ratio).
`get_status`	Return a per-document status dict (for use with async ingestion).
`ingest`	Submit jobs and retrieve results synchronously.
`ingest_async`	Submit jobs asynchronously; returns a Future that completes when done.
`load`	Ensure files are locally accessible (downloads if needed).
`pdf_split_config`	Configure V2 PDF splitting (e.g. pages per chunk). Refer to V2 API Guide.
`remaining_jobs`	Return the count of jobs not yet in a terminal state.
`save_to_disk`	Save ingestion results to disk instead of memory.
`store`	Persist extracted images/structured renderings to an fsspec-compatible backend.
`store_embed`	Add a store-embed task.
`split`	Split documents into smaller sections. Refer to Split Documents.
`udf`	Add a user-defined function (UDF) task.
`vdb_upload`	Push extraction results to Milvus vector database. Refer to Data Upload.

Extract Method Options

The following table describes the extract_method options.

Value	Status	Description
`audio`	Current	Extract information from audio files.
`nemotron_parse`	Current	NVIDIA Nemotron Parse extraction.
`ocr`	Current	Bypasses native text extraction and processes every page using the full OCR pipeline. Use this for fully scanned documents or when native text is corrupt.
`pdfium`	Current	Uses PDFium to extract native text. This is the default. This is the fastest method but does not capture text from scanned images/pages.
`pdfium_hybrid`	Current	A hybrid approach that uses PDFium for pages with native text and automatically switches to OCR for scanned pages. This offers a robust balance of speed and coverage for mixed documents.
`adobe`	Deprecated	Adobe PDF Services API extraction.
`haystack`	Deprecated	Haystack-based extraction.
`llama_parse`	Deprecated	LlamaParse extraction.
`tika`	Deprecated	Apache Tika extraction.
`unstructured_io`	Deprecated	Unstructured.io API extraction.
`unstructured_local`	Deprecated	Local Unstructured extraction.

Caption images and control reasoning

The caption task can call a vision-language model (VLM) with the following optional controls:

prompt (string): User prompt for captioning. Defaults to "Caption the content of this image:".
reasoning (boolean): Enable reasoning mode. True enables reasoning, False disables it. Defaults to None (service default, typically disabled).
context_text_max_chars (int, optional): Maximum number of characters of page text to include as context for the VLM. Omit or None for service default.
temperature (float, optional): Sampling temperature for the VLM (e.g. 0.0–1.0). Omit or None for service default.

Note

The reasoning parameter maps to the VLM's system prompt: reasoning=True sets the system prompt to "/think", and reasoning=False sets it to "/no_think" according to the [Nemotron Nano 12B v2 VL model card] (https://build.nvidia.com/nvidia/nemotron-nano-12b-v2-vl/modelcard).

Example:

from nemo_retriever.client.interface import Ingestor

ingestor = (
    Ingestor()
    .files("path/to/doc-with-images.pdf")
    .extract(extract_images=True)
    .caption(
        prompt="Caption the content of this image:",
        reasoning=True,  # Enable reasoning
    )
    .ingest()
)

Track Job Progress

For large document batches, you can enable a progress bar by setting show_progress to true. Use the following code.

ingest() parameters and return values

ingest() supports the following parameters:

Parameter	Type	Default	Description
`show_progress`	bool	`False`	Show a progress bar.
`return_failures`	bool	`False`	If `True`, return a tuple `(results, failures)`.
`return_traces`	bool	`False`	If `True`, return trace metrics (timing per stage) with results.
`save_to_disk`	bool	`False`	If `True`, save results to disk and return `LazyLoadedList` proxies (uses `save_to_disk()` config if already set).
`timeout`	int	`100`	Timeout in seconds for job processing.
`max_job_retries`	int	`None`	Maximum retries per job.
`verbose`	bool	`False`	Enable verbose logging.
`enable_telemetry`	bool	`None`	Enable or disable telemetry collection.
`show_telemetry`	bool	`None`	Print telemetry summary after ingest (env: `NV_INGEST_CLIENT_SHOW_TELEMETRY`).
`include_parent_trace_ids`	bool	`False`	If `True`, also return parent job trace IDs (V2 API).

The return value depends on the following flags:

Default: list of results.
return_failures=True: (results, failures).
return_traces=True: (results, traces).
return_failures=True and return_traces=True: (results, failures, traces).
With include_parent_trace_ids=True, an additional element (parent trace IDs) is appended to the tuple when applicable.

# Return only successes
results = ingestor.ingest(show_progress=True)

print(len(results), "successful documents")

Capture Job Failures

You can capture job failures by setting return_failures to true. Use the following code.

# Return both successes and failures
results, failures = ingestor.ingest(show_progress=True, return_failures=True)

print(f"{len(results)} successful docs; {len(failures)} failures")

if failures:
    print("Failures:", failures[:1])

To also obtain trace metrics (timing per stage), use return_traces=True:

# Return results and traces
results, traces = ingestor.ingest(show_progress=True, return_traces=True)
# Or with failures: results, failures, traces = ingestor.ingest(return_failures=True, return_traces=True)

When you use the vdb_upload method, uploads are performed after ingestion completes. The behavior of the upload depends on the following values of return_failures:

False – If any job fails, the ingest method raises a runtime error and does not upload any data (all-or-nothing data upload). This is the default setting.
True – If any jobs succeed, the results from those jobs are uploaded, and no errors are raised (partial data upload). The ingest method returns a failures object that contains the details for any jobs that failed. You can inspect the failures object and selectively retry or remediate the failed jobs.

The following example uploads data to Milvus and returns any failures.

ingestor = (
    Ingestor(client=client)
    .files(["/path/doc1.pdf", "/path/doc2.pdf"])
    .extract()
    .embed()
    .vdb_upload(collection_name="my_collection", milvus_uri="milvus.db")
)

# Use for large batches where you want successful chunks/pages to be committed, while collecting detailed diagnostics for failures.
results, failures = ingestor.ingest(return_failures=True)

print(f"Uploaded {len(results)} successful docs; {len(failures)} failures")

if failures:
    print("Failures:", failures[:1])

Async ingestion and job status

Use ingest_async() to submit jobs without blocking. It returns a concurrent.futures.Future that completes when all jobs (and any VDB upload) have finished. The future's result has the same shape as ingest() depending on return_failures and return_traces.

future = ingestor.ingest_async(return_failures=True)
results, failures = future.result()

After calling ingest_async(), you can poll progress with:

get_status() — Returns a dict mapping document identifier to status string: "pending", "submitted", "processing", "completed", "failed", "cancelled", or "unknown".
completed_jobs() — Count of jobs in the COMPLETED state.
failed_jobs() — Count of jobs in the FAILED state.
cancelled_jobs() — Count of jobs in the CANCELLED state.
remaining_jobs() — Count of jobs not yet in a terminal state.

Example:

future = ingestor.ingest_async()
while not future.done():
    status = ingestor.get_status()
    print(ingestor.completed_jobs(), "completed,", ingestor.remaining_jobs(), "remaining")
    time.sleep(2)
results = future.result()

Quick Start: Extracting PDFs

The following example demonstrates how to initialize Ingestor, load a PDF file, and extract its contents. The extract method enables different types of data to be extracted.

Extract a Single PDF

Use the following code to extract a single PDF file.

from nemo_retriever.client.interface import Ingestor

# Initialize Ingestor with a local PDF file
ingestor = Ingestor().files("path/to/document.pdf")

# Extract text, tables, and images
result = ingestor.extract().ingest()

print(result)

Extract Multiple PDFs

Use the following code to process multiple PDFs at one time.

ingestor = Ingestor().files(["path/to/doc1.pdf", "path/to/doc2.pdf"])

# Extract content from all PDFs
result = ingestor.extract().ingest()

for doc in result:
    print(doc)

Add in-memory buffers

Use buffers() to process in-memory data (e.g. BytesIO objects) instead of file paths. Pass a single (name, BytesIO) tuple or a list of such tuples.

from io import BytesIO
ingestor = Ingestor().buffers(("doc1.pdf", pdf_bytes)).extract().ingest()

Extract Specific Elements from PDFs

By default, the extract method extracts text, tables, charts, and images. Infographic extraction is not enabled by default (extract_infographics=False); set extract_infographics=True to include it. You can customize the extraction behavior by using the following code.

ingestor = ingestor.extract(
    extract_text=True,  # Extract text
    text_depth="page",
    extract_tables=False,  # Skip table extraction
    extract_charts=True,  # Extract charts
    extract_infographics=True,  # Extract infographic images
    extract_images=False  # Skip image extraction
)

Extract Non-standard Document Types

Use the following code to extract text from .md, .sh, and .html files.

ingestor = Ingestor().files(["path/to/doc1.md", "path/to/doc2.html"])

ingestor = ingestor.extract(
    extract_text=True,  # Only extract text
    extract_tables=False,
    extract_charts=False,
    extract_infographics=False,
    extract_images=False
)

result = ingestor.ingest()

Extract with Custom Document Type

Use the following code to specify a custom document type for extraction.

ingestor = ingestor.extract(document_type="pdf")

Extract Office Documents (DOCX and PPTX)

The NeMo Retriever Library offers the following two extraction methods for Microsoft Office documents (.docx and .pptx), to balance performance and layout fidelity:

Native extraction
Render as PDF

Native Extraction (Default)

The default methods (python_docx and python_pptx) extract content directly from the file structure. This is generally faster, but you might lose some visual layout information.

# Uses default native extraction
ingestor = Ingestor().files(["report.docx", "presentation.pptx"]).extract()

Render as PDF

The render_as_pdf method uses LibreOffice to convert the document to a PDF before extraction. We recommend this approach when preserving the visual layout is critical, or when you need to extract visual elements, such as tables and charts, that are better detected by using computer vision on a rendered page.

ingestor = Ingestor().files(["report.docx", "presentation.pptx"])

ingestor = ingestor.extract(
    extract_text=True,
    extract_tables=True,
    extract_charts=True,
    extract_infographics=True,
    extract_method="render_as_pdf"  # Convert to PDF first for improved visual extraction
)

PDF Extraction Strategies

The NeMo Retriever Library offers specialized strategies for PDF processing to handle various document qualities. You can select the strategy by using the following extract_method parameter values. For the full list of extract_method options, refer to Extract Method Options.

ocr – Bypasses native text extraction and processes every page using the full OCR pipeline. Use this for fully scanned documents or when native text is corrupt.
pdfium – Uses PDFium to extract native text. This is the default. This is the fastest method but does not capture text from scanned images/pages.
pdfium_hybrid – A hybrid approach that uses PDFium for pages with native text and automatically switches to OCR for scanned pages. This offers a robust balance of speed and coverage for mixed documents.

ingestor = Ingestor().files("mixed_content.pdf")

# Use hybrid mode for mixed digital/scanned PDFs
ingestor = ingestor.extract(
    document_type="pdf",
    extract_method="pdfium_hybrid",
)
results = ingestor.ingest()

Work with Large Datasets: Save to Disk

By default, the NeMo Retriever Library stores the results from every document in system memory (RAM). When you process a very large dataset with thousands of documents, you might encounter an Out-of-Memory (OOM) error. The save_to_disk method configures the extraction pipeline to write the output for each document to a separate JSONL file on disk.

Basic Usage: Save to a Directory

To save results to disk, chain the save_to_disk method to your ingestion task. You can pass an optional compression argument (default "gzip") to compress JSONL files; set compression=None to disable compression.

Parameter	Type	Default	Description
`output_directory`	str	`None`	Directory for result files. Defaults to env `NV_INGEST_CLIENT_SAVE_TO_DISK_OUTPUT_DIRECTORY` or a temporary directory.
`cleanup`	bool	`True`	If `True`, remove the output directory when exiting the context manager.
`compression`	str	`"gzip"`	Compression for JSONL files. Use `"gzip"` or `None` to disable. Compressed files use a `.gz` suffix.

By using save_to_disk the ingest method returns a list of LazyLoadedList objects, which are memory-efficient proxies that read from the result files on disk.

In the following example, the results are saved to a directory named my_ingest_results. You are responsible for managing the created files.

ingestor = Ingestor().files("large_dataset/*.pdf")

# Use save_to_disk to configure the ingestor to save results to a specific directory.
# Set cleanup=False to ensure that the directory is not deleted by any automatic process.
ingestor.save_to_disk(output_directory="./my_ingest_results", cleanup=False)  # Offload results to disk to prevent OOM errors

# 'results' is a list of LazyLoadedList objects that point to the new jsonl files.
results = ingestor.extract().ingest()

print("Ingestion results saved in ./my_ingest_results")
# You can now iterate over the results or inspect the files directly.

Managing Disk Space with Automatic Cleanup

When you use save_to_disk, the NeMo Retriever Library creates intermediate files. For workflows where these files are temporary, the NeMo Retriever Library provides two automatic cleanup mechanisms.

Directory Cleanup with Context Manager — While not required for general use, the Ingestor can be used as a context manager (with statement). This enables the automatic cleanup of the entire output directory when save_to_disk(cleanup=True) is set (which is the default).
File Purge After VDB Upload – The vdb_upload method includes a purge_results_after_upload: bool = True parameter (the default). After a successful VDB upload, this feature deletes the individual .jsonl files that were just uploaded.

You can also configure the output directory by using the NV_INGEST_CLIENT_SAVE_TO_DISK_OUTPUT_DIRECTORY environment variable.

Example (Fully Automatic Cleanup)

Fully Automatic cleanup is the recommended pattern for ingest-and-upload workflows where the intermediate files are no longer needed. The entire process is temporary, and no files are left on disk. The following example includes automatic file purge.

# After the 'with' block finishes,
# the temporary directory and all its contents are automatically deleted.

with (
    Ingestor()
    .files("/path/to/large_dataset/*.pdf")
    .extract()
    .embed()
    .save_to_disk()  # cleanup=True is the default, enables directory deletion on exit
    .vdb_upload()  # purge_results_after_upload=True is the default, deletes files after upload
) as ingestor:
    results = ingestor.ingest()

Example (Preserve Results on Disk)

In scenarios where you need to inspect or use the intermediate jsonl files, you can disable the cleanup features. The following example disables automatic file purge.

# After the 'with' block finishes,
# the './permanent_results' directory and all jsonl files are preserved for inspection or other uses.

with (
    Ingestor()
    .files("/path/to/large_dataset/*.pdf")
    .extract()
    .embed()
    .save_to_disk(output_directory="./permanent_results", cleanup=False)  # Specify a directory and disable directory-level cleanup
    .vdb_upload(purge_results_after_upload=False)  # Disable automatic file purge after the VDB upload
) as ingestor:
    results = ingestor.ingest()

Extract Captions from Images

The caption method generates image captions by using a VLM. You can use this to generate descriptions of unstructured images, infographics, and other visual content extracted from documents.

Note

To use the caption option, enable the vlm profile when you start the NeMo Retriever Library services. The default model used by caption is nvidia/llama-3.1-nemotron-nano-vl-8b-v1. For more information, refer to Profile Information in the Quickstart Guide.

Basic Usage

Tip

You can configure and use other vision language models for image captioning by specifying a different model_name and endpoint_url in the caption method. Choose a VLM that best fits your specific use case requirements.

ingestor = ingestor.caption()

To specify a different API endpoint, pass additional parameters to caption.

ingestor = ingestor.caption(
    endpoint_url="https://integrate.api.nvidia.com/v1/chat/completions",
    model_name="nvidia/llama-3.1-nemotron-nano-vl-8b-v1",
    api_key="nvapi-"
)

Captioning Infographics

Infographics are complex visual elements that combine text, charts, diagrams, and images to convey information. VLMs are particularly effective at generating descriptive captions for infographics because they can understand and summarize the visual content.

The following example extracts and captions infographics from a document:

ingestor = (
    Ingestor()
    .files("document_with_infographics.pdf")
    .extract(
        extract_text=True,
        extract_tables=True,
        extract_charts=True,
        extract_infographics=True,  # Extract infographics for captioning
        extract_images=False,
    )
    .caption(
        prompt="Describe the content and key information in this infographic:",
        reasoning=True,  # Enable reasoning for more detailed captions
    )
)
results = ingestor.ingest()

Tip

For more information about working with infographics and multimodal content, refer to Use Multimodal Embedding.

Caption Images and Control Reasoning

The caption task can call a VLM with optional prompt and system prompt overrides:

caption_prompt (user prompt): defaults to "Caption the content of this image:".
caption_system_prompt (system prompt): defaults to "/no_think" (reasoning off). Set to "/think" to enable reasoning per the Nemotron Nano 12B v2 VL model card.
context_text_max_chars (int, optional): Maximum characters of page text to include as context for the VLM.
temperature (float, optional): Sampling temperature for the VLM.

Example:

from nemo_retriever.client.interface import Ingestor

ingestor = (
    Ingestor()
    .files("path/to/doc-with-images.pdf")
    .extract(extract_images=True)
    .caption(
        prompt="Caption the content of this image:",
        system_prompt="/think",  # or "/no_think"
    )
    .ingest()
)

Extract Embeddings

The embed method in the NeMo Retriever Library generates text embeddings for document content.

ingestor = ingestor.embed()

Note

By default, embed uses the llama-nemotron-embed-1b-v2 model.

To use a different embedding model, such as nv-embedqa-e5-v5, specify a different model_name and endpoint_url.

ingestor = ingestor.embed(
    endpoint_url="https://integrate.api.nvidia.com/v1",
    model_name="nvidia/nv-embedqa-e5-v5",
    api_key="nvapi-"
)

Store Extracted Images

The store method exports decoded images (unstructured images as well as structured renderings such as tables and charts) to any fsspec-compatible URI so you can inspect or serve the generated visuals.

ingestor = ingestor.store(
    structured=True,   # persist table/chart renderings
    images=True,       # persist unstructured images
    storage_uri="file:///workspace/data/artifacts/store/images",  # Supports file://, s3://, etc.
    public_base_url="https://assets.example.com/images"  # Optional CDN/base URL for download links
)

Store Method Parameters

Parameter	Type	Description
`structured`	bool	Persist table and chart renderings. Default: `False`
`images`	bool	Persist unstructured images extracted from documents. Default: `False`
`storage_uri`	str	fsspec-compatible URI (`file://`, `s3://`, `gs://`, etc.). Defaults to server-side `IMAGE_STORAGE_URI` environment variable.
`public_base_url`	str	Optional HTTP(S) base URL for serving stored images. When set, metadata includes public download links.

Supported Storage Backends

The store task uses fsspec for storage, supporting multiple backends:

Backend	URI Format	Example
Local filesystem	`file://`	`file:///workspace/data/images`
Amazon S3	`s3://`	`s3://my-bucket/extracted-images`
Google Cloud Storage	`gs://`	`gs://my-bucket/images`
Azure Blob Storage	`abfs://`	`abfs://container@account.dfs.core.windows.net/images`
MinIO (S3-compatible)	`s3://`	`s3://nemo-retriever/artifacts/store/images` (default)

Tip

storage_uri defaults to the server-side IMAGE_STORAGE_URI environment variable (commonly s3://nemo-retriever/...). If you change that variable—for example to a host-mounted file:// path—restart the NeMo Retriever Library runtime so the container picks up the new value.

When public_base_url is provided, the metadata returned from ingest() surfaces that HTTP(S) link while still recording the underlying storage URI. Leave it unset when the storage endpoint itself is already publicly reachable.

Docker Volume Mounts for Local Storage

When running the NeMo Retriever Library via Docker and using file:// storage URIs, the path must be within a mounted volume for files to persist on the host machine.

By default, the docker-compose.yaml mounts a single volume:

volumes:
  - ${DATASET_ROOT:-./data}:/workspace/data

This means:

Container Path	Host Path	Works with `file://`?
`/workspace/data/...`	`${DATASET_ROOT}/...` (default: `./data/...`)	✅ Yes
`/tmp/...`	(container only)	❌ No - files lost on restart
`/raid/custom/path`	(container only)	❌ No - path not mounted

Example: Save to host filesystem

# Files save to ./data/artifacts/images on the host
ingestor = ingestor.store(
    structured=True,
    images=True,
    storage_uri="file:///workspace/data/artifacts/images"
)

Example: Use a custom host directory

# Set DATASET_ROOT before starting services
export DATASET_ROOT=/raid/my-project/nemo-retriever-data
docker compose up -d

# Now /workspace/data maps to /raid/my-project/nemo-retriever-data
ingestor = ingestor.store(
    structured=True,
    images=True,
    storage_uri="file:///workspace/data/extracted-images"
)
# Files save to /raid/my-project/nemo-retriever-data/extracted-images on host

For more information on environment variables, refer to Environment Variables.

Extract Audio

Use the following code to extract mp3 audio content.

from nemo_retriever.client import Ingestor

ingestor = Ingestor().files("audio_file.mp3")

ingestor = ingestor.extract(
        document_type="mp3",
        extract_text=True,
        extract_tables=False,
        extract_charts=False,
        extract_images=False,
        extract_infographics=False,
    ).split(
        tokenizer="meta-llama/Llama-3.2-1B",
        chunk_size=150,
        chunk_overlap=0,
        params={"split_source_types": ["mp3"], "hf_access_token": "hf_***"}
    )

results = ingestor.ingest()

Python API Reference

Summary of Key Methods

Ingestor Methods

Extract Method Options

Caption images and control reasoning

Track Job Progress

ingest() parameters and return values

Capture Job Failures

Async ingestion and job status

Quick Start: Extracting PDFs

Extract a Single PDF

Extract Multiple PDFs

Add in-memory buffers

Extract Specific Elements from PDFs

Extract Non-standard Document Types

Extract with Custom Document Type

Extract Office Documents (DOCX and PPTX)

Native Extraction (Default)

Render as PDF

PDF Extraction Strategies

Work with Large Datasets: Save to Disk

Basic Usage: Save to a Directory

Managing Disk Space with Automatic Cleanup

Example (Fully Automatic Cleanup)

Example (Preserve Results on Disk)

Extract Captions from Images

Basic Usage

Captioning Infographics

Caption Images and Control Reasoning

Extract Embeddings

Store Extracted Images

Store Method Parameters

Supported Storage Backends

Docker Volume Mounts for Local Storage

Extract Audio

Related Topics