nv_ingest_api.internal.extract.pdf.engines package#
Subpackages#
Submodules#
nv_ingest_api.internal.extract.pdf.engines.adobe module#
- nv_ingest_api.internal.extract.pdf.engines.adobe.adobe_extractor(
- pdf_stream: BytesIO,
- extract_text: bool,
- extract_images: bool,
- extract_infographics: bool,
- extract_tables: bool,
- extractor_config: dict,
- execution_trace_log: List[Any] | None = None,
Helper function to use unstructured-io REST API to extract text from a bytestream PDF.
- Parameters:
pdf_stream (io.BytesIO) – A bytestream PDF.
extract_text (bool) – Specifies whether to extract text.
extract_images (bool) – Specifies whether to extract images.
extract_infographics (bool) – Specifies whether to extract infographics.
extract_tables (bool) – Specifies whether to extract tables.
extractor_config (dict) – A dictionary containing additional extraction parameters such as API credentials, row_data, text_depth, and other optional settings.
execution_trace_log (optional) – Trace information for debugging purposes.
- Returns:
A string of extracted text.
- Return type:
str
- Raises:
RuntimeError – If the Adobe SDK is not installed.
ValueError – If required configuration parameters are missing or invalid.
SDKError – If there is an error during extraction.
nv_ingest_api.internal.extract.pdf.engines.llama module#
- async nv_ingest_api.internal.extract.pdf.engines.llama.async_llama_parse(
- pdf_stream: BytesIO,
- api_key: str,
- file_name: str = '_.pdf',
- result_type: str = 'text',
- check_interval_seconds: int = 1,
- max_timeout_seconds: int = 2000,
Uses the LlamaParse API to extract text from bytestream PDF.
- Parameters:
pdf_stream (io.BytesIO) – A bytestream PDF.
api_key (str) – API key from https://cloud.llamaindex.ai.
file_name (str) – Name of the PDF file.
result_type (str) – The result type for the parser. One of text or markdown.
check_interval_seconds (int) – The interval in seconds to check if the parsing is done.
max_timeout_seconds (int) – The maximum timeout in seconds to wait for the parsing to finish.
- Returns:
A string of extracted text.
- Return type:
str
- nv_ingest_api.internal.extract.pdf.engines.llama.llama_parse_extractor(
- pdf_stream: BytesIO,
- extract_text: bool,
- extract_images: bool,
- extract_infographics: bool,
- extract_tables: bool,
- extractor_config: dict,
- execution_trace_log: List[Any] | None = None,
Helper function to use LlamaParse API to extract text from a bytestream PDF.
- Parameters:
pdf_stream (io.BytesIO) – A bytestream PDF.
extract_text (bool) – Specifies whether to extract text.
extract_images (bool) – Specifies whether to extract images.
extract_tables (bool) – Specifies whether to extract tables.
extract_infographics (bool) – Specifies whether to extract infographics.
extractor_config (dict) –
- A dictionary containing additional extraction parameters including:
api_key: API key for LlamaParse.
result_type: Type of result to extract (default provided).
file_name: Name of the file (default provided).
check_interval: Interval for checking status (default provided).
max_timeout: Maximum timeout in seconds (default provided).
row_data: Row data for additional metadata.
metadata_column: Column name to extract metadata (default “metadata”).
execution_trace_log (optional) – Trace information for debugging purposes.
- Returns:
A list of extracted data. Each item is a dictionary where the key is a ContentTypeEnum and the value is a dictionary containing content and metadata.
- Return type:
List[Dict[ContentTypeEnum, Dict[str, Any]]]
- Raises:
ValueError – If extractor_config is not a dict or required parameters are missing.
nv_ingest_api.internal.extract.pdf.engines.nemoretriever module#
- nv_ingest_api.internal.extract.pdf.engines.nemoretriever.nemoretriever_parse_extractor(
- pdf_stream: BytesIO,
- extract_text: bool,
- extract_images: bool,
- extract_infographics: bool,
- extract_tables: bool,
- extract_charts: bool,
- extractor_config: dict,
- execution_trace_log: List[Any] | None = None,
Helper function to use nemoretriever_parse to extract text from a bytestream PDF.
- Parameters:
pdf_stream (io.BytesIO) – A bytestream PDF.
extract_text (bool) – Specifies whether to extract text.
extract_images (bool) – Specifies whether to extract images.
extract_tables (bool) – Specifies whether to extract tables.
extract_infographics (bool) – Specifies whether to extract infographics.
extract_charts (bool) – Specifies whether to extract charts.
execution_trace_log (Optional[List], optional) – Trace information for debugging purposes (default is None).
extractor_config (dict) –
- A dictionary containing additional extraction parameters. Expected keys include:
row_data : dict
text_depth : str, optional (default is “page”)
extract_tables_method : str, optional (default is “yolox”)
identify_nearby_objects : bool, optional (default is True)
paddle_output_format : str, optional (default is “pseudo_markdown”)
pdfium_config : dict, optional (configuration for PDFium)
nemoretriever_parse_config : dict, optional (configuration for NemoRetrieverParse)
metadata_column : str, optional (default is “metadata”)
- Returns:
A string of extracted text.
- Return type:
str
- Raises:
ValueError – If required keys are missing in extractor_config or invalid values are provided.
KeyError – If required keys are missing in row_data.
nv_ingest_api.internal.extract.pdf.engines.pdfium module#
nv_ingest_api.internal.extract.pdf.engines.tika module#
- nv_ingest_api.internal.extract.pdf.engines.tika.tika_extractor(
- pdf_stream: BytesIO,
- extract_text: bool,
- extract_images: bool,
- extract_infographics: bool,
- extract_charts: bool,
- extract_tables: bool,
- extractor_config: Dict[str, Any],
- execution_trace_log: List[Any] | None = None,
Extract text from a PDF using the Apache Tika server.
This function sends a PDF stream to the Apache Tika server and returns the extracted text. The flags for text, image, and table extraction are provided for consistency with the extractor interface; however, this implementation currently only supports text extraction.
- Parameters:
pdf_stream (io.BytesIO) – A bytestream representing the PDF to be processed.
extract_text (bool) – Flag indicating whether text extraction is desired.
extract_images (bool) – Flag indicating whether image extraction is desired.
extract_infographics (bool) – Flag indicating whether infographic extraction is desired.
extract_charts (bool) – Flag indicating whether chart extraction
extract_tables (bool) – Flag indicating whether table extraction
extractor_config (dict) – A dictionary of additional configuration options for the extractor. This parameter is currently not used by this extractor.
- Returns:
The extracted text from the PDF as returned by the Apache Tika server.
- Return type:
str
- Raises:
requests.RequestException – If the request to the Tika server fails.
Examples
>>> from io import BytesIO >>> with open("document.pdf", "rb") as f: ... pdf_stream = BytesIO(f.read()) >>> text = tika_extractor(pdf_stream, True, False, False, {})
nv_ingest_api.internal.extract.pdf.engines.unstructured_io module#
- nv_ingest_api.internal.extract.pdf.engines.unstructured_io.unstructured_io_extractor(
- pdf_stream: BytesIO,
- extract_text: bool,
- extract_images: bool,
- extract_infographics: bool,
- extract_charts: bool,
- extract_tables: bool,
- extractor_config: Dict[str, Any],
- execution_trace_log: List[Any] | None = None,
Helper function to use unstructured-io REST API to extract text from a bytestream PDF.
This function sends the provided PDF stream to the unstructured-io API and returns the extracted text. Additional parameters for the extraction are provided via the extractor_config dictionary. Note that although flags for image, table, and infographics extraction are provided, the underlying API may not support all of these features.
- Parameters:
pdf_stream (io.BytesIO) – A bytestream representing the PDF to be processed.
extract_text (bool) – Specifies whether to extract text.
extract_images (bool) – Specifies whether to extract images.
extract_infographics (bool) – Specifies whether to extract infographics.
extract_tables (bool) – Specifies whether to extract tables.
extractor_config (dict) –
- A dictionary containing additional extraction parameters:
unstructured_api_key : API key for unstructured.io.
unstructured_url : URL for the unstructured.io API endpoint.
unstructured_strategy : Strategy for extraction (default: “auto”).
unstructured_concurrency_level : Concurrency level for PDF splitting.
row_data : Row data containing source information.
text_depth : Depth of text extraction (e.g., “page”).
identify_nearby_objects : Flag for identifying nearby objects.
metadata_column : Column name for metadata extraction.
- Returns:
A string containing the extracted text.
- Return type:
str
- Raises:
ValueError – If an invalid text_depth value is provided.
SDKError – If there is an error during the extraction process.
Module contents#
- nv_ingest_api.internal.extract.pdf.engines.adobe_extractor(
- pdf_stream: BytesIO,
- extract_text: bool,
- extract_images: bool,
- extract_infographics: bool,
- extract_tables: bool,
- extractor_config: dict,
- execution_trace_log: List[Any] | None = None,
Helper function to use unstructured-io REST API to extract text from a bytestream PDF.
- Parameters:
pdf_stream (io.BytesIO) – A bytestream PDF.
extract_text (bool) – Specifies whether to extract text.
extract_images (bool) – Specifies whether to extract images.
extract_infographics (bool) – Specifies whether to extract infographics.
extract_tables (bool) – Specifies whether to extract tables.
extractor_config (dict) – A dictionary containing additional extraction parameters such as API credentials, row_data, text_depth, and other optional settings.
execution_trace_log (optional) – Trace information for debugging purposes.
- Returns:
A string of extracted text.
- Return type:
str
- Raises:
RuntimeError – If the Adobe SDK is not installed.
ValueError – If required configuration parameters are missing or invalid.
SDKError – If there is an error during extraction.
- nv_ingest_api.internal.extract.pdf.engines.llama_parse_extractor(
- pdf_stream: BytesIO,
- extract_text: bool,
- extract_images: bool,
- extract_infographics: bool,
- extract_tables: bool,
- extractor_config: dict,
- execution_trace_log: List[Any] | None = None,
Helper function to use LlamaParse API to extract text from a bytestream PDF.
- Parameters:
pdf_stream (io.BytesIO) – A bytestream PDF.
extract_text (bool) – Specifies whether to extract text.
extract_images (bool) – Specifies whether to extract images.
extract_tables (bool) – Specifies whether to extract tables.
extract_infographics (bool) – Specifies whether to extract infographics.
extractor_config (dict) –
- A dictionary containing additional extraction parameters including:
api_key: API key for LlamaParse.
result_type: Type of result to extract (default provided).
file_name: Name of the file (default provided).
check_interval: Interval for checking status (default provided).
max_timeout: Maximum timeout in seconds (default provided).
row_data: Row data for additional metadata.
metadata_column: Column name to extract metadata (default “metadata”).
execution_trace_log (optional) – Trace information for debugging purposes.
- Returns:
A list of extracted data. Each item is a dictionary where the key is a ContentTypeEnum and the value is a dictionary containing content and metadata.
- Return type:
List[Dict[ContentTypeEnum, Dict[str, Any]]]
- Raises:
ValueError – If extractor_config is not a dict or required parameters are missing.
- nv_ingest_api.internal.extract.pdf.engines.nemoretriever_parse_extractor(
- pdf_stream: BytesIO,
- extract_text: bool,
- extract_images: bool,
- extract_infographics: bool,
- extract_tables: bool,
- extract_charts: bool,
- extractor_config: dict,
- execution_trace_log: List[Any] | None = None,
Helper function to use nemoretriever_parse to extract text from a bytestream PDF.
- Parameters:
pdf_stream (io.BytesIO) – A bytestream PDF.
extract_text (bool) – Specifies whether to extract text.
extract_images (bool) – Specifies whether to extract images.
extract_tables (bool) – Specifies whether to extract tables.
extract_infographics (bool) – Specifies whether to extract infographics.
extract_charts (bool) – Specifies whether to extract charts.
execution_trace_log (Optional[List], optional) – Trace information for debugging purposes (default is None).
extractor_config (dict) –
- A dictionary containing additional extraction parameters. Expected keys include:
row_data : dict
text_depth : str, optional (default is “page”)
extract_tables_method : str, optional (default is “yolox”)
identify_nearby_objects : bool, optional (default is True)
paddle_output_format : str, optional (default is “pseudo_markdown”)
pdfium_config : dict, optional (configuration for PDFium)
nemoretriever_parse_config : dict, optional (configuration for NemoRetrieverParse)
metadata_column : str, optional (default is “metadata”)
- Returns:
A string of extracted text.
- Return type:
str
- Raises:
ValueError – If required keys are missing in extractor_config or invalid values are provided.
KeyError – If required keys are missing in row_data.
- nv_ingest_api.internal.extract.pdf.engines.pdfium_extractor(
- pdf_stream,
- extract_text: bool,
- extract_images: bool,
- extract_infographics: bool,
- extract_tables: bool,
- extract_charts: bool,
- extractor_config: dict,
- execution_trace_log: List[Any] | None = None,
- nv_ingest_api.internal.extract.pdf.engines.tika_extractor(
- pdf_stream: BytesIO,
- extract_text: bool,
- extract_images: bool,
- extract_infographics: bool,
- extract_charts: bool,
- extract_tables: bool,
- extractor_config: Dict[str, Any],
- execution_trace_log: List[Any] | None = None,
Extract text from a PDF using the Apache Tika server.
This function sends a PDF stream to the Apache Tika server and returns the extracted text. The flags for text, image, and table extraction are provided for consistency with the extractor interface; however, this implementation currently only supports text extraction.
- Parameters:
pdf_stream (io.BytesIO) – A bytestream representing the PDF to be processed.
extract_text (bool) – Flag indicating whether text extraction is desired.
extract_images (bool) – Flag indicating whether image extraction is desired.
extract_infographics (bool) – Flag indicating whether infographic extraction is desired.
extract_charts (bool) – Flag indicating whether chart extraction
extract_tables (bool) – Flag indicating whether table extraction
extractor_config (dict) – A dictionary of additional configuration options for the extractor. This parameter is currently not used by this extractor.
- Returns:
The extracted text from the PDF as returned by the Apache Tika server.
- Return type:
str
- Raises:
requests.RequestException – If the request to the Tika server fails.
Examples
>>> from io import BytesIO >>> with open("document.pdf", "rb") as f: ... pdf_stream = BytesIO(f.read()) >>> text = tika_extractor(pdf_stream, True, False, False, {})
- nv_ingest_api.internal.extract.pdf.engines.unstructured_io_extractor(
- pdf_stream: BytesIO,
- extract_text: bool,
- extract_images: bool,
- extract_infographics: bool,
- extract_charts: bool,
- extract_tables: bool,
- extractor_config: Dict[str, Any],
- execution_trace_log: List[Any] | None = None,
Helper function to use unstructured-io REST API to extract text from a bytestream PDF.
This function sends the provided PDF stream to the unstructured-io API and returns the extracted text. Additional parameters for the extraction are provided via the extractor_config dictionary. Note that although flags for image, table, and infographics extraction are provided, the underlying API may not support all of these features.
- Parameters:
pdf_stream (io.BytesIO) – A bytestream representing the PDF to be processed.
extract_text (bool) – Specifies whether to extract text.
extract_images (bool) – Specifies whether to extract images.
extract_infographics (bool) – Specifies whether to extract infographics.
extract_tables (bool) – Specifies whether to extract tables.
extractor_config (dict) –
- A dictionary containing additional extraction parameters:
unstructured_api_key : API key for unstructured.io.
unstructured_url : URL for the unstructured.io API endpoint.
unstructured_strategy : Strategy for extraction (default: “auto”).
unstructured_concurrency_level : Concurrency level for PDF splitting.
row_data : Row data containing source information.
text_depth : Depth of text extraction (e.g., “page”).
identify_nearby_objects : Flag for identifying nearby objects.
metadata_column : Column name for metadata extraction.
- Returns:
A string containing the extracted text.
- Return type:
str
- Raises:
ValueError – If an invalid text_depth value is provided.
SDKError – If there is an error during the extraction process.