nv_ingest_api.interface package#

Submodules#

nv_ingest_api.interface.extract module#

nv_ingest_api.interface.extract.extract_chart_data_from_image( *, df_ledger: DataFrame, yolox_endpoints: Tuple[str, str], paddle_endpoints: Tuple[str, str], yolox_protocol: str = 'grpc', paddle_protocol: str = 'grpc', auth_token: str = '', ) → DataFrame[source]#

Public interface to extract chart data from ledger DataFrame.

Parameters:

df_ledger (pd.DataFrame) – DataFrame containing metadata required for chart extraction.
yolox_endpoints (Tuple[str, str]) – YOLOX inference server endpoints.
paddle_endpoints (Tuple[str, str]) – PaddleOCR inference server endpoints.
yolox_protocol (str, optional) – Protocol for YOLOX inference (default “grpc”).
paddle_protocol (str, optional) – Protocol for PaddleOCR inference (default “grpc”).
auth_token (str, optional) – Authentication token for inference services.
execution_trace_log (list, optional) – Execution trace logs.

Returns:

Updated DataFrame after chart extraction.

Return type:

pd.DataFrame

Raises:

Exception – If an error occurs during extraction.

nv_ingest_api.interface.extract.extract_infographic_data_from_image( *, df_ledger: DataFrame, paddle_endpoints: Tuple[str, str] | None = None, paddle_protocol: str | None = None, auth_token: str | None = None, ) → DataFrame[source]#

Extract infographic data from a DataFrame using the configured infographic extraction pipeline.

This function creates a task configuration for infographic extraction, builds the extraction configuration from the provided PaddleOCR endpoints, protocol, and authentication token (or uses the default values from InfographicExtractorConfigSchema if None), and then calls the internal extraction function to process the DataFrame. The unified exception handler decorator ensures that any errors are appropriately logged and managed.

Parameters:

df_extraction_ledger (pd.DataFrame) – DataFrame containing the images and associated metadata from which infographic data is to be extracted.
paddle_endpoints (Optional[Tuple[str, str]], default=None) – A tuple of PaddleOCR endpoint addresses (e.g., (gRPC_endpoint, HTTP_endpoint)) used for inference. If None, the default endpoints from InfographicExtractorConfigSchema are used.
paddle_protocol (Optional[str], default=None) – The protocol (e.g., “grpc” or “http”) for PaddleOCR inference. If None, the default protocol from InfographicExtractorConfigSchema is used.
auth_token (Optional[str], default=None) – The authentication token required for secure access to PaddleOCR inference services. If None, the default value from InfographicExtractorConfigSchema is used.

Returns:

The updated DataFrame after infographic extraction has been performed.

Return type:

pd.DataFrame

Raises:

Exception – Propagates any exception raised during the extraction process, after being handled by the unified exception handler.

nv_ingest_api.interface.extract.extract_primitives_from_audio( *, df_ledger: DataFrame, audio_endpoints: Tuple[str, str], audio_infer_protocol: str = 'grpc', auth_token: str = None, use_ssl: bool = False, ssl_cert: str = None, ) → Any[source]#

Extract audio primitives from a ledger DataFrame using the specified audio configuration.

This function builds an extraction configuration based on the provided audio endpoints, inference protocol, authentication token, and SSL settings. It then delegates the extraction work to the internal function extract_text_from_audio_internal using the constructed configuration and ledger DataFrame.

Parameters:

df_ledger (pandas.DataFrame) – A DataFrame containing the ledger information required for audio extraction.
audio_endpoints (Tuple[str, str]) – A tuple of two strings representing the audio service endpoints gRPC and HTTP services.
audio_infer_protocol (str, optional) – The protocol to use for audio inference (e.g., “grpc”). Default is “grpc”.
auth_token (str, optional) – Authentication token for the audio inference service. Default is an empty string.
use_ssl (bool, optional) – Flag indicating whether to use SSL for secure connections. Default is False.
ssl_cert (str, optional) – Path to the SSL certificate file to use if use_ssl is True. Default is an empty string.

Returns:

The result of the audio extraction as returned by extract_text_from_audio_internal. The specific type depends on the internal implementation.

Return type:

Any

Raises:

Exception – Any exceptions raised during the extraction process will be handled by the @unified_exception_handler decorator.

Examples

>>> import pandas as pd
>>> # Create a sample DataFrame with ledger data
>>> df = pd.DataFrame({"audio_data": ["file1.wav", "file2.wav"]})
>>> result = extract_primitives_from_audio(
...     df_ledger=df,
...     audio_endpoints=("http://primary.endpoint", "http://secondary.endpoint"),
...     audio_infer_protocol="grpc",
...     auth_token="secret-token",
...     use_ssl=True,
...     ssl_cert="/path/to/cert.pem"
... )

nv_ingest_api.interface.extract.extract_primitives_from_docx( *, df_ledger: DataFrame, extract_text: bool = True, extract_images: bool = True, extract_tables: bool = True, extract_charts: bool = True, extract_infographics: bool = True, yolox_endpoints: Tuple[str, str] | None = None, yolox_infer_protocol: str = 'grpc', auth_token: str = '', ) → DataFrame[source]#

Extract primitives from DOCX documents in a DataFrame.

This function configures and invokes the DOCX extraction process. It builds a task configuration using the provided extraction flags (for text, images, tables, charts, and infographics) and additional settings for YOLOX endpoints, inference protocol, and authentication. It then creates a DOCX extraction configuration (an instance of DocxExtractorSchema) and delegates the extraction to an internal function.

Parameters:

df_ledger (pd.DataFrame) – The input DataFrame containing DOCX documents in base64 encoding. The DataFrame is expected to include required columns such as “content” (with the base64-encoded DOCX) and optionally “source_id”.
extract_text (bool, optional) – Flag indicating whether to extract text content from the DOCX documents (default is True).
extract_images (bool, optional) – Flag indicating whether to extract images from the DOCX documents (default is True).
extract_tables (bool, optional) – Flag indicating whether to extract tables from the DOCX documents (default is True).
extract_charts (bool, optional) – Flag indicating whether to extract charts from the DOCX documents (default is True).
extract_infographics (bool, optional) – Flag indicating whether to extract infographics from the DOCX documents (default is True).
yolox_endpoints (Optional[Tuple[str, str]], optional) – A tuple containing YOLOX inference endpoints. If None, the default endpoints defined in the DOCX extraction configuration will be used.
yolox_infer_protocol (str, optional) – The inference protocol to use with the YOLOX endpoints (default is “grpc”).
auth_token (str, optional) – The authentication token for accessing the YOLOX inference service (default is an empty string).

Returns:

A DataFrame containing the extracted DOCX primitives. Typically, the resulting DataFrame contains columns such as “document_type”, “metadata”, and “uuid”.

Return type:

pd.DataFrame

Raises:

Exception – If an error occurs during the DOCX extraction process, the exception is logged and re-raised.

nv_ingest_api.interface.extract.extract_primitives_from_image( *, df_ledger: DataFrame, extract_text: bool = True, extract_images: bool = True, extract_tables: bool = True, extract_charts: bool = True, extract_infographics: bool = True, yolox_endpoints: Tuple[str, str] | None = None, yolox_infer_protocol: str = 'grpc', auth_token: str = '', ) → DataFrame[source]#

nv_ingest_api.interface.extract.extract_primitives_from_pdf( *, df_extraction_ledger: DataFrame, extract_method: str = 'pdfium', extract_text: bool = True, extract_images: bool = True, extract_infographics: bool = True, extract_tables: bool = True, extract_charts: bool = True, text_depth: str = 'page', adobe_client_id: str | None = None, adobe_client_secret: str | None = None, llama_api_key: str | None = None, yolox_auth_token: str | None = None, yolox_endpoints: Tuple[str | None, str | None] | None = None, yolox_infer_protocol: str = 'http', nemoretriever_parse_endpoints: Tuple[str, str] | None = None, nemoretriever_parse_protocol: str = 'http', nemoretriever_parse_model_name: str = None, unstructured_io_api_key: str | None = None, tika_server_url: str | None = None, )[source]#

Extract text, images, tables, charts, and infographics from PDF documents.

This function serves as a unified interface for PDF primitive extraction, supporting multiple extraction engines (pdfium, adobe, llama, nemoretriever_parse, unstructured_io, and tika). It processes a DataFrame containing base64-encoded PDF data and returns a new DataFrame with structured information about the extracted elements.

The function uses a decorator pattern to dynamically validate configuration parameters and invoke the appropriate extraction pipeline. This design allows for flexible engine-specific configuration while maintaining a consistent interface.

Parameters:

df_extraction_ledger (pd.DataFrame) –
DataFrame containing PDF documents to process. Must include the following columns: - “content” : str

Base64-encoded PDF data
- ”source_id”str
  Unique identifier for the document
- ”source_name”str
  Name of the document (filename or descriptive name)
- ”document_type”str or enum
  Document type identifier (should be “pdf” or related enum value)
- ”metadata”Dict[str, Any]
  Dictionary containing additional metadata about the document
extract_method (str, default "pdfium") – The extraction engine to use. Valid options: - “pdfium” : PDFium-based extraction (default) - “adobe” : Adobe PDF Services API - “llama” : LlamaParse extraction - “nemoretriever_parse” : NVIDIA NemoRetriever Parse - “unstructured_io” : Unstructured.io extraction - “tika” : Apache Tika extraction
extract_text (bool, default True) – Whether to extract text content from the PDFs.
extract_images (bool, default True) – Whether to extract embedded images from the PDFs.
extract_infographics (bool, default True) – Whether to extract infographics from the PDFs.
extract_tables (bool, default True) – Whether to extract tables from the PDFs.
extract_charts (bool, default True) – Whether to extract charts and graphs from the PDFs.
text_depth (str, default "page") – Level of text granularity to extract. Options: - “page” : Text extracted at page level - “block” : Text extracted at block level - “paragraph” : Text extracted at paragraph level - “line” : Text extracted at line level
adobe_client_id (str, optional) – Client ID for Adobe PDF Services API. Required when extract_method=”adobe”.
adobe_client_secret (str, optional) – Client secret for Adobe PDF Services API. Required when extract_method=”adobe”.
llama_api_key (str, optional) – API key for LlamaParse service. Required when extract_method=”llama”.
yolox_auth_token (str, optional) – Authentication token for YOLOX inference services.
yolox_endpoints (tuple of (str, str), optional) – A tuple containing (gRPC endpoint, HTTP endpoint) for YOLOX services. At least one endpoint must be non-empty.
yolox_infer_protocol (str, default "http") – Protocol to use for YOLOX inference. Options: “http” or “grpc”.
nemoretriever_parse_endpoints (tuple of (str, str), optional) – A tuple containing (gRPC endpoint, HTTP endpoint) for NemoRetriever Parse. Required when extract_method=”nemoretriever_parse”.
nemoretriever_parse_protocol (str, default "http") – Protocol to use for NemoRetriever Parse. Options: “http” or “grpc”.
nemoretriever_parse_model_name (str, optional) – Model name for NemoRetriever Parse. Default is “nvidia/nemoretriever-parse”.
unstructured_io_api_key (str, optional) – API key for Unstructured.io services. Required when extract_method=”unstructured_io”.
tika_server_url (str, optional) – URL for Apache Tika server. Required when extract_method=”tika”.

Returns:

A DataFrame containing the extracted primitives with the following columns: - “document_type” : Type of the extracted element (e.g., “text”, “image”, “table”) - “metadata” : Dictionary containing detailed information about the extracted element - “uuid” : Unique identifier for the extracted element

Return type:

pandas.DataFrame

Raises:

ValueError – If an unsupported extraction method is specified. If required parameters for the specified extraction method are missing. If the input DataFrame does not have the required structure.
KeyError – If required columns are missing from the input DataFrame.
RuntimeError – If extraction fails due to processing errors.

Notes

The function uses a decorator pattern through extraction_interface_relay_constructor which dynamically processes the parameters and validates them against the appropriate configuration schema. The actual extraction work is delegated to the extract_primitives_from_pdf_internal function.

For each extraction method, specific parameters are required: - pdfium: yolox_endpoints - adobe: adobe_client_id, adobe_client_secret - llama: llama_api_key - nemoretriever_parse: nemoretriever_parse_endpoints - unstructured_io: unstructured_io_api_key - tika: tika_server_url

Examples

>>> import pandas as pd
>>> import base64
>>>
>>> # Read a PDF file and encode it as base64
>>> with open("document.pdf", "rb") as f:
>>>     pdf_content = base64.b64encode(f.read()).decode("utf-8")
>>>
>>> # Create a DataFrame with the PDF content
>>> df = pd.DataFrame({
>>>     "source_id": ["doc1"],
>>>     "source_name": ["document.pdf"],
>>>     "content": [pdf_content],
>>>     "document_type": ["pdf"],
>>>     "metadata": [{"content_metadata": {"type": "document"}}]
>>> })
>>>
>>> # Extract primitives using PDFium
>>> result_df = extract_primitives_from_pdf(
>>>     df_extraction_ledger=df,
>>>     extract_method="pdfium",
>>>     yolox_endpoints=(None, "http://localhost:8000/v1/infer")
>>> )
>>>
>>> # Display the types of extracted elements
>>> print(result_df["document_type"].value_counts())

nv_ingest_api.interface.extract.extract_primitives_from_pdf_nemoretriever_parse( df_extraction_ledger: DataFrame, *, extract_text: bool = True, extract_images: bool = True, extract_tables: bool = True, extract_charts: bool = True, extract_infographics: bool = True, text_depth: str = 'page', yolox_auth_token: str | None = None, yolox_endpoints: Tuple[str | None, str | None] | None = None, yolox_infer_protocol: str = 'http', nemoretriever_parse_endpoints: Tuple[str, str] | None = None, nemoretriever_parse_protocol: str = 'http', nemoretriever_parse_model_name: str | None = None, ) → DataFrame[source]#

Extract primitives from PDF documents using the NemoRetriever Parse extraction method.

This function serves as a specialized wrapper around the general extract_primitives_from_pdf function, pre-configured to use NemoRetriever Parse as the extraction engine. It processes PDF documents to extract various content types including text, images, tables, charts, and infographics, returning the results in a structured DataFrame.

Parameters:

df_extraction_ledger (pd.DataFrame) –
DataFrame containing PDF documents to process. Must include the following columns: - “content” : str

Base64-encoded PDF data
- ”source_id”str
  Unique identifier for the document
- ”source_name”str
  Name of the document (filename or descriptive name)
- ”document_type”str or enum
  Document type identifier (should be “pdf” or related enum value)
- ”metadata”Dict[str, Any]
  Dictionary containing additional metadata about the document
extract_text (bool, default True) – Whether to extract text content from the PDFs. When True, the function will attempt to extract and structure all textual content according to the granularity specified by text_depth.
extract_images (bool, default True) – Whether to extract embedded images from the PDFs. When True, the function will identify, extract, and process images embedded within the document.
extract_tables (bool, default True) – Whether to extract tables from the PDFs. When True, the function will detect tabular structures and convert them into structured data.
extract_charts (bool, default True) – Whether to extract charts and graphs from the PDFs. When True, the function will detect and extract visual data representations.
extract_infographics (bool, default True) – Whether to extract infographics from the PDFs. When True, the function will identify and extract complex visual information displays.
text_depth (str, default "page") – Level of text granularity to extract. Options: - “page” : Text extracted at page level (coarsest granularity) - “block” : Text extracted at block level (groups of paragraphs) - “paragraph” : Text extracted at paragraph level (semantic units) - “line” : Text extracted at line level (finest granularity)
yolox_auth_token (Optional[str], default None) – Authentication token for YOLOX inference services used for image processing. Required if the YOLOX services need authentication.
yolox_endpoints (Optional[Tuple[Optional[str], Optional[str]]], default None) – A tuple containing (gRPC endpoint, HTTP endpoint) for YOLOX services. Used for image processing capabilities within the extraction pipeline. Format: (grpc_endpoint, http_endpoint) Example: (None, “http://localhost:8000/v1/infer”)
yolox_infer_protocol (str, default "http") – Protocol to use for YOLOX inference. Options: - “http” : Use HTTP protocol for YOLOX inference services - “grpc” : Use gRPC protocol for YOLOX inference services
nemoretriever_parse_endpoints (Optional[Tuple[str, str]], default None) – A tuple containing (gRPC endpoint, HTTP endpoint) for NemoRetriever Parse. Format: (grpc_endpoint, http_endpoint) Example: (None, “http://localhost:8015/v1/chat/completions”) Required for this extraction method.
nemoretriever_parse_protocol (str, default "http") – Protocol to use for NemoRetriever Parse. Options: - “http” : Use HTTP protocol for NemoRetriever Parse services - “grpc” : Use gRPC protocol for NemoRetriever Parse services
nemoretriever_parse_model_name (Optional[str], default None) – Model name for NemoRetriever Parse. Default is typically “nvidia/nemoretriever-parse” if None is provided.

Returns:

A DataFrame containing the extracted primitives with the following columns: - “document_type” : str

Type of the extracted element (e.g., “text”, “image”, “structured”)

”metadata”Dict[str, Any]
Dictionary containing detailed information about the extracted element including position, content, confidence scores, etc.
”uuid”str
Unique identifier for the extracted element

Return type:

pd.DataFrame

Raises:

ValueError – If nemoretriever_parse_endpoints is None or empty If the input DataFrame does not have the required structure
KeyError – If required columns are missing from the input DataFrame
RuntimeError – If extraction fails due to service unavailability or processing errors

Examples

>>> import pandas as pd
>>> import base64
>>>
>>> # Read a PDF file and encode it as base64
>>> with open("document.pdf", "rb") as f:
>>>     pdf_content = base64.b64encode(f.read()).decode("utf-8")
>>>
>>> # Create a DataFrame with the PDF content
>>> df = pd.DataFrame({
>>>     "source_id": ["doc1"],
>>>     "source_name": ["document.pdf"],
>>>     "content": [pdf_content],
>>>     "document_type": ["pdf"],
>>>     "metadata": [{"content_metadata": {"type": "document"}}]
>>> })
>>>
>>> # Extract primitives using NemoRetriever Parse
>>> result_df = extract_primitives_from_pdf_nemoretriever_parse(
>>>     df_extraction_ledger=df,
>>>     nemoretriever_parse_endpoints=(None, "http://localhost:8015/v1/chat/completions")
>>> )
>>>
>>> # Display the types of extracted elements
>>> print(result_df["document_type"].value_counts())

Notes

NemoRetriever Parse excels at extracting structured data like tables from PDFs
For optimal results, ensure both NemoRetriever Parse and YOLOX services are properly configured and accessible
The extraction quality may vary depending on the complexity and quality of the input PDF
This function wraps the more general extract_primitives_from_pdf function with pre-configured parameters for NemoRetriever Parse extraction

nv_ingest_api.interface.extract.extract_primitives_from_pdf_pdfium( df_extraction_ledger: DataFrame, *, extract_text: bool = True, extract_images: bool = True, extract_tables: bool = True, extract_charts: bool = True, extract_infographics: bool = True, text_depth: str = 'page', yolox_auth_token: str | None = None, yolox_endpoints: Tuple[str | None, str | None] | None = None, yolox_infer_protocol: str = 'http', ) → DataFrame[source]#

Extract primitives from PDF documents using the PDFium extraction method.

A simplified wrapper around the general extract_primitives_from_pdf function that defaults to using the PDFium extraction engine.

Parameters:

df_extraction_ledger (pd.DataFrame) –
DataFrame containing PDF documents to process. Must include the following columns: - “content” : str

Base64-encoded PDF data
- ”source_id”str
  Unique identifier for the document
- ”source_name”str
  Name of the document (filename or descriptive name)
- ”document_type”str or enum
  Document type identifier (should be “pdf” or related enum value)
- ”metadata”Dict[str, Any]
  Dictionary containing additional metadata about the document
extract_text (bool, default True) – Whether to extract text content
extract_images (bool, default True) – Whether to extract embedded images
extract_tables (bool, default True) – Whether to extract tables
extract_charts (bool, default True) – Whether to extract charts
extract_infographics (bool, default True) – Whether to extract infographics
text_depth (str, default "page") – Level of text granularity (page, block, paragraph, line)
yolox_auth_token (str, optional) – Authentication token for YOLOX inference services
yolox_endpoints (tuple of (str, str), optional) – Tuple containing (gRPC endpoint, HTTP endpoint) for YOLOX services
yolox_infer_protocol (str, default "http") – Protocol to use for YOLOX inference (“http” or “grpc”)

Returns:

DataFrame containing the extracted primitives

Return type:

pd.DataFrame

nv_ingest_api.interface.extract.extract_primitives_from_pptx( *, df_ledger: DataFrame, extract_text: bool = True, extract_images: bool = True, extract_tables: bool = True, extract_charts: bool = True, extract_infographics: bool = True, yolox_endpoints: Tuple[str, str] | None = None, yolox_infer_protocol: str = 'grpc', auth_token: str = '', ) → DataFrame[source]#

Extract primitives from PPTX files provided in a DataFrame.

This function configures the PPTX extraction task by assembling a task configuration dictionary using the provided parameters. It then creates an extraction configuration object (e.g., an instance of PPTXExtractorSchema) and delegates the actual extraction process to the internal function extract_primitives_from_pptx_internal.

Parameters:

df_ledger (pd.DataFrame) – A DataFrame containing base64-encoded PPTX files. The DataFrame is expected to include columns such as “content” (with the base64-encoded PPTX) and “source_id”.
extract_text (bool, default=True) – Flag indicating whether text should be extracted from the PPTX files.
extract_images (bool, default=True) – Flag indicating whether images should be extracted.
extract_tables (bool, default=True) – Flag indicating whether tables should be extracted.
extract_charts (bool, default=True) – Flag indicating whether charts should be extracted.
extract_infographics (bool, default=True) – Flag indicating whether infographics should be extracted.
yolox_endpoints (Optional[Tuple[str, str]], default=None) – Optional tuple containing endpoints for YOLOX inference, if needed for image analysis.
yolox_infer_protocol (str, default="grpc") – The protocol to use for YOLOX inference.
auth_token (str, default="") – Authentication token to be used with the PPTX extraction configuration.

Returns:

A DataFrame containing the extracted primitives from the PPTX files. Expected columns include “document_type”, “metadata”, and “uuid”.

Return type:

pd.DataFrame

Notes

This function is decorated with @unified_exception_handler to handle exceptions uniformly. The task configuration is assembled with two main keys:

“params”: Contains boolean flags for controlling which primitives to extract.

“pptx_extraction_config”: Contains additional settings for PPTX extraction (e.g., YOLOX endpoints, inference protocol, and auth token).

It then calls extract_primitives_from_pptx_internal with the DataFrame, the task configuration, and the extraction configuration.

nv_ingest_api.interface.extract.extract_table_data_from_image( *, df_ledger: DataFrame, yolox_endpoints: Tuple[str, str] | None = None, paddle_endpoints: Tuple[str, str] | None = None, yolox_protocol: str | None = None, paddle_protocol: str | None = None, auth_token: str | None = None, ) → DataFrame[source]#

Public interface to extract chart data from a ledger DataFrame.

Parameters:

df_ledger (pd.DataFrame) – DataFrame containing metadata required for chart extraction.
yolox_endpoints (Optional[Tuple[str, str]], default=None) – YOLOX inference server endpoints. If None, the default defined in ChartExtractorConfigSchema is used.
paddle_endpoints (Optional[Tuple[str, str]], default=None) – PaddleOCR inference server endpoints. If None, the default defined in ChartExtractorConfigSchema is used.
yolox_protocol (Optional[str], default=None) – Protocol for YOLOX inference. If None, the default defined in ChartExtractorConfigSchema is used.
paddle_protocol (Optional[str], default=None) – Protocol for PaddleOCR inference. If None, the default defined in ChartExtractorConfigSchema is used.
auth_token (Optional[str], default=None) – Authentication token for inference services. If None, the default defined in ChartExtractorConfigSchema is used.

Returns:

The updated DataFrame after chart extraction.

Return type:

pd.DataFrame

Raises:

Exception – If an error occurs during extraction.

nv_ingest_api.interface.mutate module#

nv_ingest_api.interface.mutate.deduplicate_images( *, df_ledger: DataFrame, hash_algorithm: str = 'md5', ) → DataFrame[source]#

Deduplicate images in the DataFrame based on content hashes.

This function constructs a task configuration using the specified hashing algorithm and delegates the deduplication process to the internal function deduplicate_images_internal. The deduplication is performed by computing content hashes for each image in the DataFrame and then removing duplicate images.

Parameters:

df_ledger (pd.DataFrame) –
A pandas DataFrame containing image metadata. The DataFrame must include at least the columns:
- document_type: A string representing the document type (e.g., “png”).
- metadata: A dictionary that contains image-related metadata. For example, it should include keys such as content (base64-encoded image data), source_metadata, and content_metadata.
hash_algorithm (str, optional) – The hashing algorithm to use for deduplication. Valid algorithms are those supported by Python’s hashlib.new() function (e.g., “md5”, “sha1”, “sha256”). Default is “md5”.

Returns:

A deduplicated DataFrame in which duplicate images have been removed. The structure of the returned DataFrame is the same as the input, with duplicate rows eliminated.

Return type:

pd.DataFrame

Raises:

Exception – Propagates any exceptions encountered during the deduplication process.

Examples

>>> import pandas as pd
>>> # Example DataFrame with image metadata.
>>> df = pd.DataFrame({
...     "source_name": ["image1.png", "image2.png"],
...     "source_id": ["image1.png", "image2.png"],
...     "content": ["<base64-encoded-image-1>", "<base64-encoded-image-2>"],
...     "document_type": ["png", "png"],
...     "metadata": [{
...         "content": "<base64-encoded-image-1>",
...         "source_metadata": {"source_id": "image1.png", "source_name": "image1.png", "source_type": "png"},
...         "content_metadata": {"type": "image"},
...         "audio_metadata": None,
...         "text_metadata": None,
...         "image_metadata": {},
...         "raise_on_failure": False,
...     },
...     {
...         "content": "<base64-encoded-image-2>",
...         "source_metadata": {"source_id": "image2.png", "source_name": "image2.png", "source_type": "png"},
...         "content_metadata": {"type": "image"},
...         "audio_metadata": None,
...         "text_metadata": None,
...         "image_metadata": {},
...         "raise_on_failure": False,
...     }]
... })
>>> dedup_df = deduplicate_images(df_ledger=df, hash_algorithm="md5")
>>> dedup_df

nv_ingest_api.interface.mutate.filter_images( *, df_ledger: DataFrame, min_size: int = 128, max_aspect_ratio: float | int = 5.0, min_aspect_ratio: float | int = 2.0, ) → DataFrame[source]#

Apply an image filter to the ledger DataFrame based on size and aspect ratio criteria.

This function builds a set of task parameters and then delegates the filtering work to filter_images_internal. If an exception occurs during filtering, the error is logged and re-raised with additional context.

Parameters:

df_ledger (pd.DataFrame) – DataFrame containing image metadata. It must include the columns ‘document_type’ and ‘metadata’.
min_size (int, optional) – Minimum average image size threshold. Images with an average size less than or equal to this value are considered for filtering. Default is 128.
max_aspect_ratio (float or int, optional) – Maximum allowed image aspect ratio. Images with an aspect ratio greater than or equal to this value are considered for filtering. Default is 5.0.
min_aspect_ratio (float or int, optional) – Minimum allowed image aspect ratio. Images with an aspect ratio less than or equal to this value are considered for filtering. Default is 2.0.
execution_trace_log (Optional[List[Any]], optional)

Returns:

The DataFrame after applying the image filter.

Return type:

pd.DataFrame

Raises:

Exception – If an error occurs during the filtering process.

nv_ingest_api.interface.store module#

Stores embeddings by configuring task parameters and invoking the internal storage routine.

If any of the connection or configuration parameters are None, they will be omitted from the task configuration, allowing default values defined in the storage schema to be used.

Parameters:

df_ledger (pd.DataFrame) – DataFrame containing the data whose embeddings need to be stored.
milvus_address (Optional[str], default=None) – The address of the Milvus service.
milvus_uri (Optional[str], default=None) – The URI for the Milvus service.
milvus_host (Optional[str], default=None) – The host for the Milvus service.
milvus_port (Optional[int], default=None) – The port for the Milvus service.
milvus_collection_name (Optional[str], default=None) – The name of the Milvus collection.
minio_access_key (Optional[str], default=None) – The access key for MinIO.
minio_secret_key (Optional[str], default=None) – The secret key for MinIO.
minio_session_token (Optional[str], default=None) – The session token for MinIO.
minio_endpoint (Optional[str], default=None) – The endpoint URL for MinIO.
minio_bucket_name (Optional[str], default=None) – The name of the MinIO bucket.
minio_bucket_path (Optional[str], default=None) – The bucket path where embeddings will be stored.
minio_secure (Optional[bool], default=None) – Whether to use a secure connection to MinIO.
minio_region (Optional[str], default=None) – The region of the MinIO service.

Returns:

The updated DataFrame after embeddings have been stored.

Return type:

pd.DataFrame

Raises:

Exception – Propagates any exception raised during the storage process, wrapped with additional context.

nv_ingest_api.interface.store.store_images_to_minio( *, df_ledger: DataFrame, store_structured: bool = True, store_unstructured: bool = False, minio_access_key: str | None = None, minio_bucket_name: str | None = None, minio_endpoint: str | None = None, minio_region: str | None = None, minio_secret_key: str | None = None, minio_secure: bool = False, minio_session_token: str | None = None, ) → DataFrame[source]#

Store images to a Minio storage backend.

This function prepares a flat configuration dictionary for storing images and structured data to a Minio storage system. It determines which content types to store based on the provided flags and delegates the storage operation to the internal function store_images_to_minio_internal.

Parameters:

df_ledger (pd.DataFrame) – DataFrame containing ledger information with document metadata.
store_structured (bool, optional) – Flag indicating whether to store structured content. Defaults to True.
store_unstructured (bool, optional) – Flag indicating whether to store unstructured image content. Defaults to False.
minio_access_key (Optional[str], optional) – Access key for authenticating with Minio. Defaults to None.
minio_bucket_name (Optional[str], optional) – Name of the Minio bucket where images will be stored. Defaults to None.
minio_endpoint (Optional[str], optional) – Endpoint URL for the Minio service. Defaults to None.
minio_region (Optional[str], optional) – Region identifier for the Minio service. Defaults to None.
minio_secret_key (Optional[str], optional) – Secret key for authenticating with Minio. Defaults to None.
minio_secure (bool, optional) – Whether to use a secure connection (HTTPS) with Minio. Defaults to False.
minio_session_token (Optional[str], optional) – Session token for temporary credentials with Minio. Defaults to None.

Returns:

The updated DataFrame after uploading images if matching objects were found; otherwise, the original DataFrame is returned.

Return type:

pd.DataFrame

Raises:

Exception – Any exceptions raised during the image storage process will be handled by the unified_exception_handler decorator.

nv_ingest_api.interface.transform module#

Extract captions for image content using the VLM model API.

This function processes image content for caption generation. It accepts input in one of three forms:

A pandas DataFrame with the following required structure: - Columns:
- source_name (str): Identifier for the source file.
- source_id (str): Unique identifier for the file.
- content (str): Base64-encoded string representing the file content.
- document_type (str): A string representing the document type (e.g., DocumentTypeEnum.PNG).
- metadata (dict): A dictionary containing at least:
  
  content: Same as the base64-encoded file content.
  
  source_metadata: Dictionary created via create_source_metadata().
  
  content_metadata: Dictionary created via create_content_metadata().
  
  image_metadata: For image files, initialized as an empty dict ({}); other metadata fields (audio_metadata, text_metadata, etc.) are typically None or empty.
  
  raise_on_failure: Boolean flag (typically False).
A single tuple of the form (file_source, document_type). - file_source: Either a file path (str) or a file-like object (e.g., BytesIO). - document_type: A string representing the document type (e.g., DocumentTypeEnum.PNG).
A list of such tuples.

For non-DataFrame inputs, a DataFrame is constructed using the helper function build_dataframe_from_files(). When the file_source is a file-like object, its content is converted to a base64-encoded string using read_bytesio_as_base64(); if it is a file path (str), read_file_as_base64() is used.

Parameters:

inputs (Union[pd.DataFrame, tuple, List[tuple]]) –
Input data representing image content. Accepted formats:
- A pandas DataFrame with the required structure as described above.
- A single tuple (file_source, document_type).
- A list of tuples of the form (file_source, document_type).
In the tuples, file_source is either a file path (str) or a file-like object (e.g., BytesIO), and document_type is a string (typically one of the DocumentTypeEnum values).
api_key (Optional[str], default=None) – API key for authentication with the VLM endpoint. If not provided, defaults are used.
prompt (Optional[str], default=None) – Text prompt to guide caption generation.
endpoint_url (Optional[str], default=None) – URL of the VLM model HTTP endpoint.
model_name (Optional[str], default=None) – Name of the model to be used for caption generation.

Returns:

A pandas DataFrame with generated captions inserted into the metadata.image_metadata.caption field for each image row.

Return type:

pd.DataFrame

Raises:

ValueError – If the input is not a DataFrame, tuple, or list of tuples, or if any tuple is not of length 2.
Exception – Propagates any exception encountered during processing or caption extraction.

Examples

>>> # Example using a DataFrame:
>>> df = pd.DataFrame({
...     "source_name": ["image.png"],
...     "source_id": ["image.png"],
...     "content": ["<base64-string>"],
...     "document_type": ["png"],
...     "metadata": [{
...         "content": "<base64-string>",
...         "source_metadata": {...},
...         "content_metadata": {...},
...         "image_metadata": {},
...         "raise_on_failure": False,
...     }],
... })
>>> transform_image_create_vlm_caption(inputs=df, api_key="key", prompt="Caption the image:")

>>> # Example using a tuple:
>>> transform_image_create_vlm_caption(inputs=("image.png", DocumentTypeEnum.PNG), api_key="key",
    prompt="Caption the image:")

>>> # Example using a list of tuples with file paths:
>>> transform_image_create_vlm_caption(inputs=[("image.png", DocumentTypeEnum.PNG),
    ("image2.png", DocumentTypeEnum.PNG)], api_key="key", prompt="Caption the image:")

>>> # Example using a list of tuples with BytesIO objects:
>>> from io import BytesIO
>>> with open("image.png", "rb") as f:
...     bytes_io = BytesIO(f.read())
>>> transform_image_create_vlm_caption(inputs=[(bytes_io, DocumentTypeEnum.PNG)],
    api_key="key", prompt="Caption the image:")

nv_ingest_api.interface.transform.transform_text_create_embeddings( *, inputs: DataFrame, api_key: str, batch_size: int | None = 8192, embedding_model: str | None = None, embedding_nim_endpoint: str | None = None, encoding_format: str | None = None, input_type: str | None = None, truncate: str | None = None, ) → DataFrame[source]#: Creates text embeddings using the provided configuration. Parameters provided as None will use the default values from EmbedExtractionsSchema.

nv_ingest_api.interface.transform.transform_text_split_and_tokenize( *, inputs: DataFrame | str | List[str], tokenizer: str, chunk_size: int, chunk_overlap: int, split_source_types: List[str] | None = None, hugging_face_access_token: str | None = None, ) → DataFrame[source]#

Transform and tokenize text documents by splitting them into smaller chunks.

This function prepares the configuration parameters for text splitting and tokenization, and then delegates the splitting and asynchronous tokenization to an internal function.

The function accepts input in one of two forms:

A pandas DataFrame that already follows the required structure:
Required DataFrame Structure:
- source_name (str): Identifier for the source document.
- source_id (str): Unique identifier for the document.
- content (str): The document content (typically as a base64-encoded string).
- document_type (str): For plain text, set to DocumentTypeEnum.TXT.
- metadata (dict): Must contain:
  
  content: The original text content.
  
  content_metadata: A dictionary with a key “type” (e.g., “text”).
  
  source_metadata: A dictionary with source-specific metadata (e.g., file path, timestamps).
  
  Other keys (audio_metadata, image_metadata, etc.) set to None or empty as appropriate.
  
  raise_on_failure: Boolean (typically False).
A plain text string or a list of plain text strings. In this case, the function converts each text into a BytesIO object (encoding it as UTF-8) and then uses the helper function build_dataframe_from_files to construct a DataFrame where:
- source_name and source_id are generated as “text_0”, “text_1”, etc.
- content is the base64-encoded representation of the UTF-8 encoded text.
- document_type is set to DocumentTypeEnum.TXT.
- metadata is constructed using helper functions (for source and content metadata), with content_metadata’s “type” set to “text”.

Parameters:

inputs (Union[pd.DataFrame, str, List[str]]) – Either a DataFrame following the required structure, a single plain text string, or a list of plain text strings.
tokenizer (str) – Identifier or path of the tokenizer to be used (e.g., “bert-base-uncased”).
chunk_size (int) – Maximum number of tokens per chunk.
chunk_overlap (int) – Number of tokens to overlap between consecutive chunks.
split_source_types (Optional[List[str]], default=["text"]) – List of source types to filter for text splitting. If None or empty, defaults to [“text”].
hugging_face_access_token (Optional[str], default=None) – Access token for Hugging Face authentication, if required.

Returns:

A DataFrame with the processed documents, where text content has been split into smaller chunks. The returned DataFrame retains the original columns and updates the “metadata” field with generated tokenized segments and embedding information.

Return type:

pd.DataFrame

Raises:

Exception – Propagates any exceptions encountered during text splitting and tokenization, with additional context provided by the unified exception handler.

Examples

>>> # Using a DataFrame:
>>> import pandas as pd
>>> df = pd.DataFrame({
...     "source_name": ["doc1.txt"],
...     "source_id": ["doc1.txt"],
...     "content": ["<base64-encoded text>"],
...     "document_type": ["text"],
...     "metadata": [{
...         "content": "This is a document.",
...         "content_metadata": {"type": "text"},
...         "source_metadata": {"source_id": "doc1.txt", "source_name": "doc1.txt", "source_type": "txt"},
...         "audio_metadata": None,
...         "image_metadata": None,
...         "text_metadata": None,
...         "raise_on_failure": False,
...     }],
... })
>>> transform_text_split_and_tokenize(
...     inputs=df,
...     tokenizer="bert-base-uncased",
...     chunk_size=512,
...     chunk_overlap=50
... )

>>> # Using a single plain text string:
>>> transform_text_split_and_tokenize(
...     inputs="This is a plain text document.",
...     tokenizer="bert-base-uncased",
...     chunk_size=512,
...     chunk_overlap=50
... )

>>> # Using a list of plain text strings:
>>> texts = ["Document one text.", "Document two text."]
>>> transform_text_split_and_tokenize(
...     inputs=texts,
...     tokenizer="bert-base-uncased",
...     chunk_size=512,
...     chunk_overlap=50
... )

nv_ingest_api.interface.utility module#

nv_ingest_api.interface.utility.build_dataframe_from_files( file_paths: List[str | BytesIO], source_names: List[str], source_ids: List[str], document_types: List[str], ) → DataFrame[source]#

Given lists of file paths (or BytesIO objects), source names, source IDs, and document types, reads each file (base64-encoding its contents) and constructs a DataFrame.

For image content, ‘image_metadata’ is initialized as an empty dict, so it can later be updated.

nv_ingest_api.interface.utility.create_content_metadata(document_type: str) → dict[source]#

Creates a content metadata dictionary for a file based on its document type.

It maps the document type to the corresponding content type.

nv_ingest_api.interface.utility.create_source_metadata( source_name: str, source_id: str, document_type: str, ) → dict[source]#

Creates a source metadata dictionary for a file.

The source_type is set to the provided document_type. The date_created and last_modified fields are set to the current ISO timestamp.

nv_ingest_api.interface.utility.get_document_type_from_extension(file_path: str) → str[source]#

nv_ingest_api.interface.utility.read_bytesio_as_base64(file_io: BytesIO) → str[source]#

Reads a BytesIO object and returns its base64-encoded string.

Parameters:: file_io (BytesIO) – A file-like object containing binary data.
Returns:: The base64-encoded string representation of the file’s contents.
Return type:: str

nv_ingest_api.interface.utility.read_file_as_base64(file_path: str) → str[source]#: Reads the file at file_path in binary mode and returns its base64-encoded string.

Module contents#

nv_ingest_api.interface.extraction_interface_relay_constructor( api_fn, task_keys: List[str] | None = None, )[source]#

Decorator for constructing and validating configuration using Pydantic schemas.

This decorator wraps a user-facing interface function. It extracts common task parameters (using the provided task_keys, or defaults if not specified) and method-specific configuration parameters based on a required ‘extract_method’ keyword argument. It then uses the corresponding Pydantic schema (from the global CONFIG_SCHEMAS registry) to validate and build a method-specific configuration. The resulting composite configuration, along with the extraction ledger and execution trace log, is passed to the backend API function.

Parameters:

api_fn (callable) –
The backend API function that will be called with the extraction ledger, the task configuration dictionary, the extractor configuration, and the execution trace log. This function must conform to the signature:

extract_primitives_from_pdf_internal(df_extraction_ledger: pd.DataFrame,
task_config: Dict[str, Any], extractor_config: Any, execution_trace_log: Optional[List[Any]] = None)
task_keys (list of str, optional) – A list of keyword names that should be extracted from the user function as common task parameters. If not provided, defaults to [“extract_text”, “extract_images”, “extract_tables”, “extract_charts”].

Returns:

A wrapped function that builds and validates the configuration before invoking the backend API function.

Return type:

callable

Raises:

ValueError – If the extraction method specified is not supported (i.e., no corresponding Pydantic schema exists in CONFIG_SCHEMAS), if api_fn does not conform to the expected signature, or if the required ‘extract_method’ parameter is not provided.