nv_ingest_api.internal.extract.pdf.engines package#

Subpackages#

nv_ingest_api.internal.extract.pdf.engines.pdf_helpers package
- Module contents

Submodules#

nv_ingest_api.internal.extract.pdf.engines.adobe module#

nv_ingest_api.internal.extract.pdf.engines.adobe.adobe_extractor( pdf_stream: BytesIO, extract_text: bool, extract_images: bool, extract_infographics: bool, extract_tables: bool, extractor_config: dict, execution_trace_log: List[Any] | None = None, ) → DataFrame[source]#

Helper function to use unstructured-io REST API to extract text from a bytestream PDF.

Parameters:

pdf_stream (io.BytesIO) – A bytestream PDF.
extract_text (bool) – Specifies whether to extract text.
extract_images (bool) – Specifies whether to extract images.
extract_infographics (bool) – Specifies whether to extract infographics.
extract_tables (bool) – Specifies whether to extract tables.
extractor_config (dict) – A dictionary containing additional extraction parameters such as API credentials, row_data, text_depth, and other optional settings.
execution_trace_log (optional) – Trace information for debugging purposes.

Returns:

A string of extracted text.

Return type:

str

Raises:

RuntimeError – If the Adobe SDK is not installed.
ValueError – If required configuration parameters are missing or invalid.
SDKError – If there is an error during extraction.

nv_ingest_api.internal.extract.pdf.engines.llama module#

async nv_ingest_api.internal.extract.pdf.engines.llama.async_llama_parse( pdf_stream: BytesIO, api_key: str, file_name: str = '_.pdf', result_type: str = 'text', check_interval_seconds: int = 1, max_timeout_seconds: int = 2000, ) → str[source]#

Uses the LlamaParse API to extract text from bytestream PDF.

Parameters:

pdf_stream (io.BytesIO) – A bytestream PDF.
api_key (str) – API key from https://cloud.llamaindex.ai.
file_name (str) – Name of the PDF file.
result_type (str) – The result type for the parser. One of text or markdown.
check_interval_seconds (int) – The interval in seconds to check if the parsing is done.
max_timeout_seconds (int) – The maximum timeout in seconds to wait for the parsing to finish.

Returns:

A string of extracted text.

Return type:

str

nv_ingest_api.internal.extract.pdf.engines.llama.llama_parse_extractor( pdf_stream: BytesIO, extract_text: bool, extract_images: bool, extract_infographics: bool, extract_tables: bool, extractor_config: dict, execution_trace_log: List[Any] | None = None, ) → List[Dict[ContentTypeEnum, Dict[str, Any]]][source]#

Helper function to use LlamaParse API to extract text from a bytestream PDF.

Parameters:

pdf_stream (io.BytesIO) – A bytestream PDF.
extract_text (bool) – Specifies whether to extract text.
extract_images (bool) – Specifies whether to extract images.
extract_tables (bool) – Specifies whether to extract tables.
extract_infographics (bool) – Specifies whether to extract infographics.
extractor_config (dict) –
A dictionary containing additional extraction parameters including:
- api_key: API key for LlamaParse.
- result_type: Type of result to extract (default provided).
- file_name: Name of the file (default provided).
- check_interval: Interval for checking status (default provided).
- max_timeout: Maximum timeout in seconds (default provided).
- row_data: Row data for additional metadata.
- metadata_column: Column name to extract metadata (default “metadata”).
execution_trace_log (optional) – Trace information for debugging purposes.

Returns:

A list of extracted data. Each item is a dictionary where the key is a ContentTypeEnum and the value is a dictionary containing content and metadata.

Return type:

List[Dict[ContentTypeEnum, Dict[str, Any]]]

Raises:

ValueError – If extractor_config is not a dict or required parameters are missing.

nv_ingest_api.internal.extract.pdf.engines.nemoretriever module#

nv_ingest_api.internal.extract.pdf.engines.nemoretriever.nemoretriever_parse_extractor( pdf_stream: BytesIO, extract_text: bool, extract_images: bool, extract_infographics: bool, extract_tables: bool, extract_charts: bool, extractor_config: dict, execution_trace_log: List[Any] | None = None, ) → str[source]#

Helper function to use nemoretriever_parse to extract text from a bytestream PDF.

Parameters:

pdf_stream (io.BytesIO) – A bytestream PDF.
extract_text (bool) – Specifies whether to extract text.
extract_images (bool) – Specifies whether to extract images.
extract_tables (bool) – Specifies whether to extract tables.
extract_infographics (bool) – Specifies whether to extract infographics.
extract_charts (bool) – Specifies whether to extract charts.
execution_trace_log (Optional[List], optional) – Trace information for debugging purposes (default is None).
extractor_config (dict) –
A dictionary containing additional extraction parameters. Expected keys include:
- row_data : dict
- text_depth : str, optional (default is “page”)
- extract_tables_method : str, optional (default is “yolox”)
- identify_nearby_objects : bool, optional (default is True)
- paddle_output_format : str, optional (default is “pseudo_markdown”)
- pdfium_config : dict, optional (configuration for PDFium)
- nemoretriever_parse_config : dict, optional (configuration for NemoRetrieverParse)
- metadata_column : str, optional (default is “metadata”)

Returns:

A string of extracted text.

Return type:

str

Raises:

ValueError – If required keys are missing in extractor_config or invalid values are provided.
KeyError – If required keys are missing in row_data.

nv_ingest_api.internal.extract.pdf.engines.pdfium module#

nv_ingest_api.internal.extract.pdf.engines.pdfium.pdfium_extractor( pdf_stream, extract_text: bool, extract_images: bool, extract_infographics: bool, extract_tables: bool, extract_charts: bool, extractor_config: dict, execution_trace_log: List[Any] | None = None, ) → DataFrame[source]#

nv_ingest_api.internal.extract.pdf.engines.tika module#

nv_ingest_api.internal.extract.pdf.engines.tika.tika_extractor( pdf_stream: BytesIO, extract_text: bool, extract_images: bool, extract_infographics: bool, extract_charts: bool, extract_tables: bool, extractor_config: Dict[str, Any], execution_trace_log: List[Any] | None = None, ) → DataFrame[source]#

Extract text from a PDF using the Apache Tika server.

This function sends a PDF stream to the Apache Tika server and returns the extracted text. The flags for text, image, and table extraction are provided for consistency with the extractor interface; however, this implementation currently only supports text extraction.

Parameters:

pdf_stream (io.BytesIO) – A bytestream representing the PDF to be processed.
extract_text (bool) – Flag indicating whether text extraction is desired.
extract_images (bool) – Flag indicating whether image extraction is desired.
extract_infographics (bool) – Flag indicating whether infographic extraction is desired.
extract_charts (bool) – Flag indicating whether chart extraction
extract_tables (bool) – Flag indicating whether table extraction
extractor_config (dict) – A dictionary of additional configuration options for the extractor. This parameter is currently not used by this extractor.

Returns:

The extracted text from the PDF as returned by the Apache Tika server.

Return type:

str

Raises:

requests.RequestException – If the request to the Tika server fails.

Examples

>>> from io import BytesIO
>>> with open("document.pdf", "rb") as f:
...     pdf_stream = BytesIO(f.read())
>>> text = tika_extractor(pdf_stream, True, False, False, {})

nv_ingest_api.internal.extract.pdf.engines.unstructured_io module#

nv_ingest_api.internal.extract.pdf.engines.unstructured_io.unstructured_io_extractor( pdf_stream: BytesIO, extract_text: bool, extract_images: bool, extract_infographics: bool, extract_charts: bool, extract_tables: bool, extractor_config: Dict[str, Any], execution_trace_log: List[Any] | None = None, ) → DataFrame[source]#

Helper function to use unstructured-io REST API to extract text from a bytestream PDF.

This function sends the provided PDF stream to the unstructured-io API and returns the extracted text. Additional parameters for the extraction are provided via the extractor_config dictionary. Note that although flags for image, table, and infographics extraction are provided, the underlying API may not support all of these features.

Parameters:

pdf_stream (io.BytesIO) – A bytestream representing the PDF to be processed.
extract_text (bool) – Specifies whether to extract text.
extract_images (bool) – Specifies whether to extract images.
extract_infographics (bool) – Specifies whether to extract infographics.
extract_tables (bool) – Specifies whether to extract tables.
extractor_config (dict) –
A dictionary containing additional extraction parameters:
- unstructured_api_key : API key for unstructured.io.
- unstructured_url : URL for the unstructured.io API endpoint.
- unstructured_strategy : Strategy for extraction (default: “auto”).
- unstructured_concurrency_level : Concurrency level for PDF splitting.
- row_data : Row data containing source information.
- text_depth : Depth of text extraction (e.g., “page”).
- identify_nearby_objects : Flag for identifying nearby objects.
- metadata_column : Column name for metadata extraction.

Returns:

A string containing the extracted text.

Return type:

str

Raises:

ValueError – If an invalid text_depth value is provided.
SDKError – If there is an error during the extraction process.

Module contents#

nv_ingest_api.internal.extract.pdf.engines.adobe_extractor( pdf_stream: BytesIO, extract_text: bool, extract_images: bool, extract_infographics: bool, extract_tables: bool, extractor_config: dict, execution_trace_log: List[Any] | None = None, ) → DataFrame[source]#

Helper function to use unstructured-io REST API to extract text from a bytestream PDF.

Parameters:

pdf_stream (io.BytesIO) – A bytestream PDF.
extract_text (bool) – Specifies whether to extract text.
extract_images (bool) – Specifies whether to extract images.
extract_infographics (bool) – Specifies whether to extract infographics.
extract_tables (bool) – Specifies whether to extract tables.
extractor_config (dict) – A dictionary containing additional extraction parameters such as API credentials, row_data, text_depth, and other optional settings.
execution_trace_log (optional) – Trace information for debugging purposes.

Returns:

A string of extracted text.

Return type:

str

Raises:

RuntimeError – If the Adobe SDK is not installed.
ValueError – If required configuration parameters are missing or invalid.
SDKError – If there is an error during extraction.

nv_ingest_api.internal.extract.pdf.engines.llama_parse_extractor( pdf_stream: BytesIO, extract_text: bool, extract_images: bool, extract_infographics: bool, extract_tables: bool, extractor_config: dict, execution_trace_log: List[Any] | None = None, ) → List[Dict[ContentTypeEnum, Dict[str, Any]]][source]#

Helper function to use LlamaParse API to extract text from a bytestream PDF.

Parameters:

pdf_stream (io.BytesIO) – A bytestream PDF.
extract_text (bool) – Specifies whether to extract text.
extract_images (bool) – Specifies whether to extract images.
extract_tables (bool) – Specifies whether to extract tables.
extract_infographics (bool) – Specifies whether to extract infographics.
extractor_config (dict) –
A dictionary containing additional extraction parameters including:
- api_key: API key for LlamaParse.
- result_type: Type of result to extract (default provided).
- file_name: Name of the file (default provided).
- check_interval: Interval for checking status (default provided).
- max_timeout: Maximum timeout in seconds (default provided).
- row_data: Row data for additional metadata.
- metadata_column: Column name to extract metadata (default “metadata”).
execution_trace_log (optional) – Trace information for debugging purposes.

Returns:

A list of extracted data. Each item is a dictionary where the key is a ContentTypeEnum and the value is a dictionary containing content and metadata.

Return type:

List[Dict[ContentTypeEnum, Dict[str, Any]]]

Raises:

ValueError – If extractor_config is not a dict or required parameters are missing.

nv_ingest_api.internal.extract.pdf.engines.nemoretriever_parse_extractor( pdf_stream: BytesIO, extract_text: bool, extract_images: bool, extract_infographics: bool, extract_tables: bool, extract_charts: bool, extractor_config: dict, execution_trace_log: List[Any] | None = None, ) → str[source]#

Helper function to use nemoretriever_parse to extract text from a bytestream PDF.

Parameters:

pdf_stream (io.BytesIO) – A bytestream PDF.
extract_text (bool) – Specifies whether to extract text.
extract_images (bool) – Specifies whether to extract images.
extract_tables (bool) – Specifies whether to extract tables.
extract_infographics (bool) – Specifies whether to extract infographics.
extract_charts (bool) – Specifies whether to extract charts.
execution_trace_log (Optional[List], optional) – Trace information for debugging purposes (default is None).
extractor_config (dict) –
A dictionary containing additional extraction parameters. Expected keys include:
- row_data : dict
- text_depth : str, optional (default is “page”)
- extract_tables_method : str, optional (default is “yolox”)
- identify_nearby_objects : bool, optional (default is True)
- paddle_output_format : str, optional (default is “pseudo_markdown”)
- pdfium_config : dict, optional (configuration for PDFium)
- nemoretriever_parse_config : dict, optional (configuration for NemoRetrieverParse)
- metadata_column : str, optional (default is “metadata”)

Returns:

A string of extracted text.

Return type:

str

Raises:

ValueError – If required keys are missing in extractor_config or invalid values are provided.
KeyError – If required keys are missing in row_data.

nv_ingest_api.internal.extract.pdf.engines.pdfium_extractor( pdf_stream, extract_text: bool, extract_images: bool, extract_infographics: bool, extract_tables: bool, extract_charts: bool, extractor_config: dict, execution_trace_log: List[Any] | None = None, ) → DataFrame[source]#

nv_ingest_api.internal.extract.pdf.engines.tika_extractor( pdf_stream: BytesIO, extract_text: bool, extract_images: bool, extract_infographics: bool, extract_charts: bool, extract_tables: bool, extractor_config: Dict[str, Any], execution_trace_log: List[Any] | None = None, ) → DataFrame[source]#

Extract text from a PDF using the Apache Tika server.

Parameters:

pdf_stream (io.BytesIO) – A bytestream representing the PDF to be processed.
extract_text (bool) – Flag indicating whether text extraction is desired.
extract_images (bool) – Flag indicating whether image extraction is desired.
extract_infographics (bool) – Flag indicating whether infographic extraction is desired.
extract_charts (bool) – Flag indicating whether chart extraction
extract_tables (bool) – Flag indicating whether table extraction
extractor_config (dict) – A dictionary of additional configuration options for the extractor. This parameter is currently not used by this extractor.

Returns:

The extracted text from the PDF as returned by the Apache Tika server.

Return type:

str

Raises:

requests.RequestException – If the request to the Tika server fails.

Examples

>>> from io import BytesIO
>>> with open("document.pdf", "rb") as f:
...     pdf_stream = BytesIO(f.read())
>>> text = tika_extractor(pdf_stream, True, False, False, {})

nv_ingest_api.internal.extract.pdf.engines.unstructured_io_extractor( pdf_stream: BytesIO, extract_text: bool, extract_images: bool, extract_infographics: bool, extract_charts: bool, extract_tables: bool, extractor_config: Dict[str, Any], execution_trace_log: List[Any] | None = None, ) → DataFrame[source]#

Helper function to use unstructured-io REST API to extract text from a bytestream PDF.

Parameters:

pdf_stream (io.BytesIO) – A bytestream representing the PDF to be processed.
extract_text (bool) – Specifies whether to extract text.
extract_images (bool) – Specifies whether to extract images.
extract_infographics (bool) – Specifies whether to extract infographics.
extract_tables (bool) – Specifies whether to extract tables.
extractor_config (dict) –
A dictionary containing additional extraction parameters:
- unstructured_api_key : API key for unstructured.io.
- unstructured_url : URL for the unstructured.io API endpoint.
- unstructured_strategy : Strategy for extraction (default: “auto”).
- unstructured_concurrency_level : Concurrency level for PDF splitting.
- row_data : Row data containing source information.
- text_depth : Depth of text extraction (e.g., “page”).
- identify_nearby_objects : Flag for identifying nearby objects.
- metadata_column : Column name for metadata extraction.

Returns:

A string containing the extracted text.

Return type:

str

Raises:

ValueError – If an invalid text_depth value is provided.
SDKError – If there is an error during the extraction process.