nv_ingest_api.util.pdf package#

Submodules#

nv_ingest_api.util.pdf.pdfium module#

nv_ingest_api.util.pdf.pdfium.convert_bitmap_to_corrected_numpy( bitmap: PdfBitmap, ) → ndarray[source]#

Converts a PdfBitmap to a correctly formatted NumPy array, handling any necessary channel swapping based on the bitmap’s mode.

Parameters:: bitmap (pdfium.PdfBitmap) – The bitmap object rendered from a PDF page.
Returns:: A NumPy array representing the correctly formatted image data.
Return type:: np.ndarray

nv_ingest_api.util.pdf.pdfium.convert_pdfium_position(pos, page_width, page_height)[source]#: Convert a PDFium bounding box (which typically has an origin at the bottom-left) to a more standard bounding-box format with y=0 at the top.

Note

This method assumes the PDF coordinate system follows the common convention where the origin is at the bottom-left. However, per the PDF specification, the coordinate system can theoretically be defined between any opposite corners, and its origin may not necessarily be (0,0). This implementation may not handle all edge cases where the coordinate system is arbitrarily transformed.

Further processing may be necessary downstream, particularly in filtering or deduplication stages, to account for variations in coordinate transformations and ensure consistent bounding-box comparisons.

See pypdfium2-team/pypdfium2#284.

nv_ingest_api.util.pdf.pdfium.extract_forms_from_pdfium_page(page, **kwargs)[source]#: Extract bounding boxes for PDF form objects from a PDFium page, removing any bounding boxes that strictly enclose other boxes (i.e., are strict supersets).

nv_ingest_api.util.pdf.pdfium.extract_image_like_objects_from_pdfium_page(

page,

merge=True,

**kwargs,

)[source]#

nv_ingest_api.util.pdf.pdfium.extract_merged_images_from_pdfium_page(page, merge=True, **kwargs)[source]#: Extract bounding boxes of image objects from a PDFium page, with optional merging of bounding boxes that likely belong to the same compound image.

nv_ingest_api.util.pdf.pdfium.extract_merged_shapes_from_pdfium_page(page, merge=True, **kwargs)[source]#: Extract bounding boxes of path objects (shapes) from a PDFium page, and optionally merge those bounding boxes if they appear to be part of the same shape group. Also filters out shapes that occupy more than half the page area.

nv_ingest_api.util.pdf.pdfium.extract_nested_simple_images_from_pdfium_page(page)[source]#

nv_ingest_api.util.pdf.pdfium.extract_simple_images_from_pdfium_page(page, max_depth)[source]#

nv_ingest_api.util.pdf.pdfium.extract_top_level_simple_images_from_pdfium_page(page)[source]#

nv_ingest_api.util.pdf.pdfium.pdfium_pages_to_numpy( pages: List[PdfPage], render_dpi: int = 300, scale_tuple: Tuple[int, int] | None = None, padding_tuple: Tuple[int, int] | None = None, rotation: int = 0, ) → tuple[list[ndarray | ndarray[Any, dtype[Any]]], list[tuple[int, int]]][source]#

Converts a list of PdfPage objects to a list of NumPy arrays, where each array represents an image of the corresponding PDF page.

The function renders each page as a bitmap, converts it to a PIL image, applies any specified scaling using the thumbnail approach, and adds padding if requested. The DPI for rendering can be specified, with a default value of 300 DPI.

Parameters:

pages (List[pdfium.PdfPage]) – A list of PdfPage objects to be rendered and converted into NumPy arrays.
render_dpi (int, optional) – The DPI (dots per inch) at which to render the pages. Must be between 50 and 1200. Defaults to 300.
scale_tuple (Optional[Tuple[int, int]], optional) – A tuple (width, height) to resize the rendered image to using the thumbnail approach. Defaults to None.
padding_tuple (Optional[Tuple[int, int]], optional) – A tuple (width, height) to pad the image to. Defaults to None.
rotation

Returns:

A tuple containing:

A list of NumPy arrays, where each array corresponds to an image of a PDF page. Each array is an independent copy of the rendered image data.
A list of padding offsets applied to each image, as tuples of (offset_width, offset_height).

Return type:

tuple

Raises:

ValueError – If the render_dpi is outside the allowed range (50-1200).
PdfiumError – If there is an issue rendering the page or converting it to a NumPy array.
IOError – If there is an error saving the image to disk.

nv_ingest_api.util.pdf.pdfium.pdfium_try_get_bitmap_as_numpy(image_obj) → ndarray[source]#

Attempts to retrieve the bitmap from a PdfImage object and convert it to a NumPy array, first with rendering enabled and then without rendering if the first attempt fails.

Parameters:: image_obj (PdfImage) – The PdfImage object from which to extract the bitmap.
Returns:: The extracted bitmap as a NumPy array.
Return type:: np.ndarray
Raises:: PdfiumError – If an exception occurs during bitmap retrieval and both attempts fail.

Notes

This function first tries to retrieve the bitmap with rendering enabled (render=True). If that fails or the bitmap returned is None, it attempts to retrieve the raw bitmap without rendering (render=False). Any errors encountered during these attempts are logged at the debug level.

nv_ingest_api.util.pdf package#

Submodules#

nv_ingest_api.util.pdf.pdfium module#

Module contents#