nv_ingest.extraction_workflows.pdf package#

Submodules#

nv_ingest.extraction_workflows.pdf.adobe_helper module#

nv_ingest.extraction_workflows.pdf.adobe_helper.adobe(
pdf_stream: BytesIO,
extract_text: bool,
extract_images: bool,
extract_tables: bool,
**kwargs,
)[source]#

Helper function to use unstructured-io REST API to extract text from a bytestream PDF.

Parameters:
  • pdf_stream (io.BytesIO) – A bytestream PDF.

  • extract_text (bool) – Specifies whether or not to extract text.

  • extract_images (bool) – Specifies whether or not to extract images.

  • extract_tables (bool) – Specifies whether or not to extract tables.

  • **kwargs – The keyword arguments are used for additional extraction parameters.

Returns:

A string of extracted text.

Return type:

str

Raises:

SDKError – If there is an error with the extraction.

nv_ingest.extraction_workflows.pdf.llama_parse_helper module#

async nv_ingest.extraction_workflows.pdf.llama_parse_helper.async_llama_parse(
pdf_stream: BytesIO,
api_key: str,
file_name: str = '_.pdf',
result_type: str = 'text',
check_interval_seconds: int = 1,
max_timeout_seconds: int = 2000,
) str[source]#

Uses the LlamaParse API to extract text from bytestream PDF.

Parameters:
  • pdf_stream (io.BytesIO) – A bytestream PDF.

  • api_key (str) – API key from https://cloud.llamaindex.ai.

  • file_name (str) – Name of the PDF file.

  • result_type (str) – The result type for the parser. One of text or markdown.

  • check_interval_seconds (int) – The interval in seconds to check if the parsing is done.

  • max_timeout_seconds (int) – The maximum timeout in seconds to wait for the parsing to finish.

Returns:

A string of extracted text.

Return type:

str

nv_ingest.extraction_workflows.pdf.llama_parse_helper.llama_parse(
pdf_stream: BytesIO,
extract_text: bool,
extract_images: bool,
extract_tables: bool,
**kwargs,
) List[Dict[ContentTypeEnum, Dict[str, Any]]][source]#

Helper function to use LlamaParse API to extract text from a bytestream PDF.

Parameters:
  • pdf_stream (io.BytesIO) – A bytestream PDF.

  • extract_text (bool) – Specifies whether to extract text.

  • extract_images (bool) – Specifies whether to extract images.

  • extract_tables (bool) – Specifies whether to extract tables.

  • **kwargs – The keyword arguments are used for additional extraction parameters.

Returns:

A list of extracted data. Each item in the list is a list of [document type, dictionary] pairs, where the dictionary contains content and metadata of the extracted PDF.

Return type:

List[List[ExtractedDocumentType, Dict[str, Any]]]

nv_ingest.extraction_workflows.pdf.nemoretriever_parse_helper module#

nv_ingest.extraction_workflows.pdf.nemoretriever_parse_helper.nemoretriever_parse(
pdf_stream,
extract_text: bool,
extract_images: bool,
extract_tables: bool,
extract_charts: bool,
trace_info: List | None = None,
**kwargs,
)[source]#

Helper function to use nemoretriever_parse to extract text from a bytestream PDF.

Parameters:
  • pdf_stream (io.BytesIO) – A bytestream PDF.

  • extract_text (bool) – Specifies whether to extract text.

  • extract_images (bool) – Specifies whether to extract images.

  • extract_tables (bool) – Specifies whether to extract tables.

  • **kwargs – The keyword arguments are used for additional extraction parameters.

Returns:

A string of extracted text.

Return type:

str

nv_ingest.extraction_workflows.pdf.pdfium_helper module#

nv_ingest.extraction_workflows.pdf.pdfium_helper.extract_page_element_images(
annotation_dict,
original_image,
page_idx,
page_elements,
padding_offset=(0, 0),
)[source]#

Handle the extraction of page elements from the inference results and run additional model inference.

Parameters:
  • annotation_dict (dict/) – A dictionary containing detected objects and their bounding boxes.

  • original_image (np.ndarray) – The original image from which objects were detected.

  • page_idx (int) – The index of the current page being processed.

  • page_elements (List[Tuple[int, ImageTable]]) – A list to which extracted page elements will be appended.

Notes

This function iterates over detected objects, crops the original image to the bounding boxes, and runs additional inference on the cropped images to extract detailed information about page elements.

Examples

>>> annotation_dict = {"table": [], "chart": []}
>>> original_image = np.random.rand(1536, 1536, 3)
>>> page_elements = []
>>> extract_page_element_images(annotation_dict, original_image, 0, page_elements)
nv_ingest.extraction_workflows.pdf.pdfium_helper.extract_page_elements_using_image_ensemble(
pages: List[Tuple[int, ndarray, Tuple[int, int]]],
config: PDFiumConfigSchema,
trace_info: List | None = None,
) List[Tuple[int, object]][source]#

Given a list of (page_index, image) tuples, this function calls the YOLOX-based inference service to extract page element annotations from all pages.

Returns:

For each page, returns (page_index, joined_content) where joined_content is the result of combining annotations from the inference.

Return type:

List[Tuple[int, object]]

nv_ingest.extraction_workflows.pdf.pdfium_helper.pdfium_extractor(
pdf_stream,
extract_text: bool,
extract_images: bool,
extract_tables: bool,
extract_charts: bool,
trace_info=None,
**kwargs,
)[source]#

nv_ingest.extraction_workflows.pdf.tika_helper module#

nv_ingest.extraction_workflows.pdf.tika_helper.tika(
pdf_stream,
extract_text,
extract_images,
extract_tables,
**kwargs,
)[source]#

nv_ingest.extraction_workflows.pdf.unstructured_io_helper module#

nv_ingest.extraction_workflows.pdf.unstructured_io_helper.unstructured_io(
pdf_stream: BytesIO,
extract_text: bool,
extract_images: bool,
extract_tables: bool,
**kwargs,
)[source]#

Helper function to use unstructured-io REST API to extract text from a bytestream PDF.

Parameters:
  • pdf_stream (io.BytesIO) – A bytestream PDF.

  • extract_text (bool) – Specifies whether or not to extract text.

  • extract_images (bool) – Specifies whether or not to extract images.

  • extract_tables (bool) – Specifies whether or not to extract tables.

  • **kwargs – The keyword arguments are used for additional extraction parameters.

Returns:

A string of extracted text.

Return type:

str

Raises:

SDKError – If there is an error with the extraction.

Module contents#

nv_ingest.extraction_workflows.pdf.adobe(
pdf_stream: BytesIO,
extract_text: bool,
extract_images: bool,
extract_tables: bool,
**kwargs,
)[source]#

Helper function to use unstructured-io REST API to extract text from a bytestream PDF.

Parameters:
  • pdf_stream (io.BytesIO) – A bytestream PDF.

  • extract_text (bool) – Specifies whether or not to extract text.

  • extract_images (bool) – Specifies whether or not to extract images.

  • extract_tables (bool) – Specifies whether or not to extract tables.

  • **kwargs – The keyword arguments are used for additional extraction parameters.

Returns:

A string of extracted text.

Return type:

str

Raises:

SDKError – If there is an error with the extraction.

nv_ingest.extraction_workflows.pdf.llama_parse(
pdf_stream: BytesIO,
extract_text: bool,
extract_images: bool,
extract_tables: bool,
**kwargs,
) List[Dict[ContentTypeEnum, Dict[str, Any]]][source]#

Helper function to use LlamaParse API to extract text from a bytestream PDF.

Parameters:
  • pdf_stream (io.BytesIO) – A bytestream PDF.

  • extract_text (bool) – Specifies whether to extract text.

  • extract_images (bool) – Specifies whether to extract images.

  • extract_tables (bool) – Specifies whether to extract tables.

  • **kwargs – The keyword arguments are used for additional extraction parameters.

Returns:

A list of extracted data. Each item in the list is a list of [document type, dictionary] pairs, where the dictionary contains content and metadata of the extracted PDF.

Return type:

List[List[ExtractedDocumentType, Dict[str, Any]]]

nv_ingest.extraction_workflows.pdf.nemoretriever_parse(
pdf_stream,
extract_text: bool,
extract_images: bool,
extract_tables: bool,
extract_charts: bool,
trace_info: List | None = None,
**kwargs,
)[source]#

Helper function to use nemoretriever_parse to extract text from a bytestream PDF.

Parameters:
  • pdf_stream (io.BytesIO) – A bytestream PDF.

  • extract_text (bool) – Specifies whether to extract text.

  • extract_images (bool) – Specifies whether to extract images.

  • extract_tables (bool) – Specifies whether to extract tables.

  • **kwargs – The keyword arguments are used for additional extraction parameters.

Returns:

A string of extracted text.

Return type:

str

nv_ingest.extraction_workflows.pdf.pdfium(
pdf_stream,
extract_text: bool,
extract_images: bool,
extract_tables: bool,
extract_charts: bool,
trace_info=None,
**kwargs,
)#
nv_ingest.extraction_workflows.pdf.tika(
pdf_stream,
extract_text,
extract_images,
extract_tables,
**kwargs,
)[source]#
nv_ingest.extraction_workflows.pdf.unstructured_io(
pdf_stream: BytesIO,
extract_text: bool,
extract_images: bool,
extract_tables: bool,
**kwargs,
)[source]#

Helper function to use unstructured-io REST API to extract text from a bytestream PDF.

Parameters:
  • pdf_stream (io.BytesIO) – A bytestream PDF.

  • extract_text (bool) – Specifies whether or not to extract text.

  • extract_images (bool) – Specifies whether or not to extract images.

  • extract_tables (bool) – Specifies whether or not to extract tables.

  • **kwargs – The keyword arguments are used for additional extraction parameters.

Returns:

A string of extracted text.

Return type:

str

Raises:

SDKError – If there is an error with the extraction.