nv_ingest_api.internal.extract.pdf.engines package#

Subpackages#

Submodules#

nv_ingest_api.internal.extract.pdf.engines.adobe module#

nv_ingest_api.internal.extract.pdf.engines.adobe.adobe_extractor(
pdf_stream: BytesIO,
extract_text: bool,
extract_images: bool,
extract_infographics: bool,
extract_tables: bool,
extractor_config: dict,
execution_trace_log: List[Any] | None = None,
) DataFrame[source]#

Helper function to use unstructured-io REST API to extract text from a bytestream PDF.

Parameters:
  • pdf_stream (io.BytesIO) – A bytestream PDF.

  • extract_text (bool) – Specifies whether to extract text.

  • extract_images (bool) – Specifies whether to extract images.

  • extract_infographics (bool) – Specifies whether to extract infographics.

  • extract_tables (bool) – Specifies whether to extract tables.

  • extractor_config (dict) – A dictionary containing additional extraction parameters such as API credentials, row_data, text_depth, and other optional settings.

  • execution_trace_log (optional) – Trace information for debugging purposes.

Returns:

A string of extracted text.

Return type:

str

Raises:
  • RuntimeError – If the Adobe SDK is not installed.

  • ValueError – If required configuration parameters are missing or invalid.

  • SDKError – If there is an error during extraction.

nv_ingest_api.internal.extract.pdf.engines.llama module#

async nv_ingest_api.internal.extract.pdf.engines.llama.async_llama_parse(
pdf_stream: BytesIO,
api_key: str,
file_name: str = '_.pdf',
result_type: str = 'text',
check_interval_seconds: int = 1,
max_timeout_seconds: int = 2000,
) str[source]#

Uses the LlamaParse API to extract text from bytestream PDF.

Parameters:
  • pdf_stream (io.BytesIO) – A bytestream PDF.

  • api_key (str) – API key from https://cloud.llamaindex.ai.

  • file_name (str) – Name of the PDF file.

  • result_type (str) – The result type for the parser. One of text or markdown.

  • check_interval_seconds (int) – The interval in seconds to check if the parsing is done.

  • max_timeout_seconds (int) – The maximum timeout in seconds to wait for the parsing to finish.

Returns:

A string of extracted text.

Return type:

str

nv_ingest_api.internal.extract.pdf.engines.llama.llama_parse_extractor(
pdf_stream: BytesIO,
extract_text: bool,
extract_images: bool,
extract_infographics: bool,
extract_tables: bool,
extractor_config: dict,
execution_trace_log: List[Any] | None = None,
) List[Dict[ContentTypeEnum, Dict[str, Any]]][source]#

Helper function to use LlamaParse API to extract text from a bytestream PDF.

Parameters:
  • pdf_stream (io.BytesIO) – A bytestream PDF.

  • extract_text (bool) – Specifies whether to extract text.

  • extract_images (bool) – Specifies whether to extract images.

  • extract_tables (bool) – Specifies whether to extract tables.

  • extract_infographics (bool) – Specifies whether to extract infographics.

  • extractor_config (dict) –

    A dictionary containing additional extraction parameters including:
    • api_key: API key for LlamaParse.

    • result_type: Type of result to extract (default provided).

    • file_name: Name of the file (default provided).

    • check_interval: Interval for checking status (default provided).

    • max_timeout: Maximum timeout in seconds (default provided).

    • row_data: Row data for additional metadata.

    • metadata_column: Column name to extract metadata (default “metadata”).

  • execution_trace_log (optional) – Trace information for debugging purposes.

Returns:

A list of extracted data. Each item is a dictionary where the key is a ContentTypeEnum and the value is a dictionary containing content and metadata.

Return type:

List[Dict[ContentTypeEnum, Dict[str, Any]]]

Raises:

ValueError – If extractor_config is not a dict or required parameters are missing.

nv_ingest_api.internal.extract.pdf.engines.nemoretriever module#

nv_ingest_api.internal.extract.pdf.engines.nemoretriever.nemoretriever_parse_extractor(
pdf_stream: BytesIO,
extract_text: bool,
extract_images: bool,
extract_infographics: bool,
extract_tables: bool,
extract_charts: bool,
extractor_config: dict,
execution_trace_log: List[Any] | None = None,
) str[source]#

Helper function to use nemoretriever_parse to extract text from a bytestream PDF.

Parameters:
  • pdf_stream (io.BytesIO) – A bytestream PDF.

  • extract_text (bool) – Specifies whether to extract text.

  • extract_images (bool) – Specifies whether to extract images.

  • extract_tables (bool) – Specifies whether to extract tables.

  • extract_infographics (bool) – Specifies whether to extract infographics.

  • extract_charts (bool) – Specifies whether to extract charts.

  • execution_trace_log (Optional[List], optional) – Trace information for debugging purposes (default is None).

  • extractor_config (dict) –

    A dictionary containing additional extraction parameters. Expected keys include:
    • row_data : dict

    • text_depth : str, optional (default is “page”)

    • extract_tables_method : str, optional (default is “yolox”)

    • identify_nearby_objects : bool, optional (default is True)

    • paddle_output_format : str, optional (default is “pseudo_markdown”)

    • pdfium_config : dict, optional (configuration for PDFium)

    • nemoretriever_parse_config : dict, optional (configuration for NemoRetrieverParse)

    • metadata_column : str, optional (default is “metadata”)

Returns:

A string of extracted text.

Return type:

str

Raises:
  • ValueError – If required keys are missing in extractor_config or invalid values are provided.

  • KeyError – If required keys are missing in row_data.

nv_ingest_api.internal.extract.pdf.engines.pdfium module#

nv_ingest_api.internal.extract.pdf.engines.pdfium.pdfium_extractor(
pdf_stream,
extract_text: bool,
extract_images: bool,
extract_infographics: bool,
extract_tables: bool,
extract_charts: bool,
extractor_config: dict,
execution_trace_log: List[Any] | None = None,
) DataFrame[source]#

nv_ingest_api.internal.extract.pdf.engines.tika module#

nv_ingest_api.internal.extract.pdf.engines.tika.tika_extractor(
pdf_stream: BytesIO,
extract_text: bool,
extract_images: bool,
extract_infographics: bool,
extract_charts: bool,
extract_tables: bool,
extractor_config: Dict[str, Any],
execution_trace_log: List[Any] | None = None,
) DataFrame[source]#

Extract text from a PDF using the Apache Tika server.

This function sends a PDF stream to the Apache Tika server and returns the extracted text. The flags for text, image, and table extraction are provided for consistency with the extractor interface; however, this implementation currently only supports text extraction.

Parameters:
  • pdf_stream (io.BytesIO) – A bytestream representing the PDF to be processed.

  • extract_text (bool) – Flag indicating whether text extraction is desired.

  • extract_images (bool) – Flag indicating whether image extraction is desired.

  • extract_infographics (bool) – Flag indicating whether infographic extraction is desired.

  • extract_charts (bool) – Flag indicating whether chart extraction

  • extract_tables (bool) – Flag indicating whether table extraction

  • extractor_config (dict) – A dictionary of additional configuration options for the extractor. This parameter is currently not used by this extractor.

Returns:

The extracted text from the PDF as returned by the Apache Tika server.

Return type:

str

Raises:

requests.RequestException – If the request to the Tika server fails.

Examples

>>> from io import BytesIO
>>> with open("document.pdf", "rb") as f:
...     pdf_stream = BytesIO(f.read())
>>> text = tika_extractor(pdf_stream, True, False, False, {})

nv_ingest_api.internal.extract.pdf.engines.unstructured_io module#

nv_ingest_api.internal.extract.pdf.engines.unstructured_io.unstructured_io_extractor(
pdf_stream: BytesIO,
extract_text: bool,
extract_images: bool,
extract_infographics: bool,
extract_charts: bool,
extract_tables: bool,
extractor_config: Dict[str, Any],
execution_trace_log: List[Any] | None = None,
) DataFrame[source]#

Helper function to use unstructured-io REST API to extract text from a bytestream PDF.

This function sends the provided PDF stream to the unstructured-io API and returns the extracted text. Additional parameters for the extraction are provided via the extractor_config dictionary. Note that although flags for image, table, and infographics extraction are provided, the underlying API may not support all of these features.

Parameters:
  • pdf_stream (io.BytesIO) – A bytestream representing the PDF to be processed.

  • extract_text (bool) – Specifies whether to extract text.

  • extract_images (bool) – Specifies whether to extract images.

  • extract_infographics (bool) – Specifies whether to extract infographics.

  • extract_tables (bool) – Specifies whether to extract tables.

  • extractor_config (dict) –

    A dictionary containing additional extraction parameters:
    • unstructured_api_key : API key for unstructured.io.

    • unstructured_url : URL for the unstructured.io API endpoint.

    • unstructured_strategy : Strategy for extraction (default: “auto”).

    • unstructured_concurrency_level : Concurrency level for PDF splitting.

    • row_data : Row data containing source information.

    • text_depth : Depth of text extraction (e.g., “page”).

    • identify_nearby_objects : Flag for identifying nearby objects.

    • metadata_column : Column name for metadata extraction.

Returns:

A string containing the extracted text.

Return type:

str

Raises:
  • ValueError – If an invalid text_depth value is provided.

  • SDKError – If there is an error during the extraction process.

Module contents#

nv_ingest_api.internal.extract.pdf.engines.adobe_extractor(
pdf_stream: BytesIO,
extract_text: bool,
extract_images: bool,
extract_infographics: bool,
extract_tables: bool,
extractor_config: dict,
execution_trace_log: List[Any] | None = None,
) DataFrame[source]#

Helper function to use unstructured-io REST API to extract text from a bytestream PDF.

Parameters:
  • pdf_stream (io.BytesIO) – A bytestream PDF.

  • extract_text (bool) – Specifies whether to extract text.

  • extract_images (bool) – Specifies whether to extract images.

  • extract_infographics (bool) – Specifies whether to extract infographics.

  • extract_tables (bool) – Specifies whether to extract tables.

  • extractor_config (dict) – A dictionary containing additional extraction parameters such as API credentials, row_data, text_depth, and other optional settings.

  • execution_trace_log (optional) – Trace information for debugging purposes.

Returns:

A string of extracted text.

Return type:

str

Raises:
  • RuntimeError – If the Adobe SDK is not installed.

  • ValueError – If required configuration parameters are missing or invalid.

  • SDKError – If there is an error during extraction.

nv_ingest_api.internal.extract.pdf.engines.llama_parse_extractor(
pdf_stream: BytesIO,
extract_text: bool,
extract_images: bool,
extract_infographics: bool,
extract_tables: bool,
extractor_config: dict,
execution_trace_log: List[Any] | None = None,
) List[Dict[ContentTypeEnum, Dict[str, Any]]][source]#

Helper function to use LlamaParse API to extract text from a bytestream PDF.

Parameters:
  • pdf_stream (io.BytesIO) – A bytestream PDF.

  • extract_text (bool) – Specifies whether to extract text.

  • extract_images (bool) – Specifies whether to extract images.

  • extract_tables (bool) – Specifies whether to extract tables.

  • extract_infographics (bool) – Specifies whether to extract infographics.

  • extractor_config (dict) –

    A dictionary containing additional extraction parameters including:
    • api_key: API key for LlamaParse.

    • result_type: Type of result to extract (default provided).

    • file_name: Name of the file (default provided).

    • check_interval: Interval for checking status (default provided).

    • max_timeout: Maximum timeout in seconds (default provided).

    • row_data: Row data for additional metadata.

    • metadata_column: Column name to extract metadata (default “metadata”).

  • execution_trace_log (optional) – Trace information for debugging purposes.

Returns:

A list of extracted data. Each item is a dictionary where the key is a ContentTypeEnum and the value is a dictionary containing content and metadata.

Return type:

List[Dict[ContentTypeEnum, Dict[str, Any]]]

Raises:

ValueError – If extractor_config is not a dict or required parameters are missing.

nv_ingest_api.internal.extract.pdf.engines.nemoretriever_parse_extractor(
pdf_stream: BytesIO,
extract_text: bool,
extract_images: bool,
extract_infographics: bool,
extract_tables: bool,
extract_charts: bool,
extractor_config: dict,
execution_trace_log: List[Any] | None = None,
) str[source]#

Helper function to use nemoretriever_parse to extract text from a bytestream PDF.

Parameters:
  • pdf_stream (io.BytesIO) – A bytestream PDF.

  • extract_text (bool) – Specifies whether to extract text.

  • extract_images (bool) – Specifies whether to extract images.

  • extract_tables (bool) – Specifies whether to extract tables.

  • extract_infographics (bool) – Specifies whether to extract infographics.

  • extract_charts (bool) – Specifies whether to extract charts.

  • execution_trace_log (Optional[List], optional) – Trace information for debugging purposes (default is None).

  • extractor_config (dict) –

    A dictionary containing additional extraction parameters. Expected keys include:
    • row_data : dict

    • text_depth : str, optional (default is “page”)

    • extract_tables_method : str, optional (default is “yolox”)

    • identify_nearby_objects : bool, optional (default is True)

    • paddle_output_format : str, optional (default is “pseudo_markdown”)

    • pdfium_config : dict, optional (configuration for PDFium)

    • nemoretriever_parse_config : dict, optional (configuration for NemoRetrieverParse)

    • metadata_column : str, optional (default is “metadata”)

Returns:

A string of extracted text.

Return type:

str

Raises:
  • ValueError – If required keys are missing in extractor_config or invalid values are provided.

  • KeyError – If required keys are missing in row_data.

nv_ingest_api.internal.extract.pdf.engines.pdfium_extractor(
pdf_stream,
extract_text: bool,
extract_images: bool,
extract_infographics: bool,
extract_tables: bool,
extract_charts: bool,
extractor_config: dict,
execution_trace_log: List[Any] | None = None,
) DataFrame[source]#
nv_ingest_api.internal.extract.pdf.engines.tika_extractor(
pdf_stream: BytesIO,
extract_text: bool,
extract_images: bool,
extract_infographics: bool,
extract_charts: bool,
extract_tables: bool,
extractor_config: Dict[str, Any],
execution_trace_log: List[Any] | None = None,
) DataFrame[source]#

Extract text from a PDF using the Apache Tika server.

This function sends a PDF stream to the Apache Tika server and returns the extracted text. The flags for text, image, and table extraction are provided for consistency with the extractor interface; however, this implementation currently only supports text extraction.

Parameters:
  • pdf_stream (io.BytesIO) – A bytestream representing the PDF to be processed.

  • extract_text (bool) – Flag indicating whether text extraction is desired.

  • extract_images (bool) – Flag indicating whether image extraction is desired.

  • extract_infographics (bool) – Flag indicating whether infographic extraction is desired.

  • extract_charts (bool) – Flag indicating whether chart extraction

  • extract_tables (bool) – Flag indicating whether table extraction

  • extractor_config (dict) – A dictionary of additional configuration options for the extractor. This parameter is currently not used by this extractor.

Returns:

The extracted text from the PDF as returned by the Apache Tika server.

Return type:

str

Raises:

requests.RequestException – If the request to the Tika server fails.

Examples

>>> from io import BytesIO
>>> with open("document.pdf", "rb") as f:
...     pdf_stream = BytesIO(f.read())
>>> text = tika_extractor(pdf_stream, True, False, False, {})
nv_ingest_api.internal.extract.pdf.engines.unstructured_io_extractor(
pdf_stream: BytesIO,
extract_text: bool,
extract_images: bool,
extract_infographics: bool,
extract_charts: bool,
extract_tables: bool,
extractor_config: Dict[str, Any],
execution_trace_log: List[Any] | None = None,
) DataFrame[source]#

Helper function to use unstructured-io REST API to extract text from a bytestream PDF.

This function sends the provided PDF stream to the unstructured-io API and returns the extracted text. Additional parameters for the extraction are provided via the extractor_config dictionary. Note that although flags for image, table, and infographics extraction are provided, the underlying API may not support all of these features.

Parameters:
  • pdf_stream (io.BytesIO) – A bytestream representing the PDF to be processed.

  • extract_text (bool) – Specifies whether to extract text.

  • extract_images (bool) – Specifies whether to extract images.

  • extract_infographics (bool) – Specifies whether to extract infographics.

  • extract_tables (bool) – Specifies whether to extract tables.

  • extractor_config (dict) –

    A dictionary containing additional extraction parameters:
    • unstructured_api_key : API key for unstructured.io.

    • unstructured_url : URL for the unstructured.io API endpoint.

    • unstructured_strategy : Strategy for extraction (default: “auto”).

    • unstructured_concurrency_level : Concurrency level for PDF splitting.

    • row_data : Row data containing source information.

    • text_depth : Depth of text extraction (e.g., “page”).

    • identify_nearby_objects : Flag for identifying nearby objects.

    • metadata_column : Column name for metadata extraction.

Returns:

A string containing the extracted text.

Return type:

str

Raises:
  • ValueError – If an invalid text_depth value is provided.

  • SDKError – If there is an error during the extraction process.