nv_ingest_api.util.image_processing package#

Submodules#

nv_ingest_api.util.image_processing.clustering module#

nv_ingest_api.util.image_processing.clustering.boxes_are_close_or_overlap(
b1: List[int],
b2: List[int],
threshold: float = 10.0,
) bool[source]#

Determine if two bounding boxes either overlap or are within a certain distance threshold.

The function expands each bounding box by threshold in all directions and checks if the expanded regions overlap on both the x-axis and y-axis.

Parameters:
  • (tuple) (b2)

  • (tuple)

  • (float (threshold) – each bounding box before checking for overlap. Defaults to 10.0.

  • optional) (The distance (in pixels or points) by which to expand) – each bounding box before checking for overlap. Defaults to 10.0.

Returns:

True if the two bounding boxes overlap or are within the specified threshold distance of each other, False otherwise.

Return type:

bool

Example

>>> box1 = (100, 100, 150, 150)
>>> box2 = (160, 110, 200, 140)
>>> boxes_are_close_or_overlap(box1, box2, threshold=10)
True  # Because box2 is within 10 pixels of box1 along the x-axis
nv_ingest_api.util.image_processing.clustering.combine_groups_into_bboxes(
boxes: List[List[int]],
groups: List[List[int]],
min_num_components: int = 1,
) List[List[int]][source]#

Merge bounding boxes based on grouped indices.

Given:
  • A list of bounding boxes (boxes), each in the form (xmin, ymin, xmax, ymax).

  • A list of groups (groups), where each group is a list of indices referring to bounding boxes in boxes.

For each group, this function:
  1. Collects all bounding boxes in that group.

  2. Computes a single bounding box that tightly encompasses all of those bounding boxes by taking the minimum of all xmins and ymins, and the maximum of all xmaxs and ymaxs.

  3. If the group has fewer than min_num_components bounding boxes, it is skipped.

Parameters:
  • tuple) (boxes (list of) – The original bounding boxes, each in (xmin, ymin, xmax, ymax) format.

  • int) (groups (list of list of) – A list of groups, where each group is a list of indices into boxes.

  • (int (min_num_components) – The minimum number of bounding boxes a group must have to produce a merged bounding box. Defaults to 1.

  • optional) – The minimum number of bounding boxes a group must have to produce a merged bounding box. Defaults to 1.

Returns:

A list of merged bounding boxes, one for each group that meets or exceeds min_num_components. Each bounding box is in the format (xmin, ymin, xmax, ymax).

Return type:

list of list of int

nv_ingest_api.util.image_processing.clustering.group_bounding_boxes(
boxes: List[List[int]],
threshold: float = 10.0,
max_num_boxes: int = 1000,
max_depth: int | None = None,
) List[List[int]][source]#

Group bounding boxes that either overlap or lie within a given proximity threshold.

This function first checks whether the number of bounding boxes exceeds max_num_boxes, returning an empty list if it does (to avoid excessive computation). Then, it builds an adjacency list by comparing each pair of bounding boxes (using boxes_are_close_or_overlap). Any bounding boxes determined to be within threshold distance (or overlapping) are treated as connected.

Using a Depth-First Search (DFS), we traverse these connections to form groups (connected components). Each group is a list of indices referencing bounding boxes in the original boxes list.

Parameters:
  • tuple) (boxes (list of) – A list of bounding boxes in the format (xmin, ymin, xmax, ymax).

  • (float (threshold) – The distance threshold used to determine if two boxes are considered “close enough” to be in the same group. Defaults to 10.0.

  • optional) – The distance threshold used to determine if two boxes are considered “close enough” to be in the same group. Defaults to 10.0.

  • (int (max_depth) – The maximum number of bounding boxes to process. If the length of boxes exceeds this, a warning is logged and the function returns an empty list. Defaults to 1,000.

  • optional) – The maximum number of bounding boxes to process. If the length of boxes exceeds this, a warning is logged and the function returns an empty list. Defaults to 1,000.

  • (int – The maximum depth for the DFS. If None, there is no limit to how many layers deep the search may go when forming connected components. If set, bounding boxes beyond that depth in the adjacency graph will not be included in the group. Defaults to None.

  • optional) – The maximum depth for the DFS. If None, there is no limit to how many layers deep the search may go when forming connected components. If set, bounding boxes beyond that depth in the adjacency graph will not be included in the group. Defaults to None.

Returns:

Each element is a list (group) containing the indices of bounding boxes that are connected (overlapping or within threshold distance of each other).

Return type:

list of list of int

nv_ingest_api.util.image_processing.clustering.remove_superset_bboxes(
bboxes: List[List[int]],
) List[List[int]][source]#

Remove any bounding box that strictly contains another bounding box.

Specifically, for each bounding box box_a, if it fully encloses another bounding box box_b in all dimensions (with at least one edge strictly larger rather than exactly equal), then box_a is excluded from the results.

Parameters:

(List[List[int]]) (bboxes) – A list of bounding boxes, where each bounding box is a list or tuple of four integers in the format: [x_min, y_min, x_max, y_max].

Returns:

A new list of bounding boxes, excluding those that are strict supersets of any other bounding box in bboxes.

Return type:

List[List[int]]

Example

>>> bboxes = [
...     [0, 0, 5, 5],   # box A
...     [1, 1, 2, 2],   # box B
...     [3, 3, 4, 4]    # box C
... ]
>>> # Box A strictly encloses B and C, so it is removed
>>> remove_superset_bboxes(bboxes)
[[1, 1, 2, 2], [3, 3, 4, 4]]

nv_ingest_api.util.image_processing.processing module#

nv_ingest_api.util.image_processing.processing.extract_tables_and_charts_from_image(
annotation_dict,
original_image,
page_idx,
tables_and_charts,
)[source]#

Extract and process table and chart regions from the provided image based on detection annotations.

Parameters:
  • annotation_dict (dict) – A dictionary containing detected objects and their bounding boxes, e.g. keys “table” and “chart”.

  • original_image (np.ndarray) – The original image from which objects were detected.

  • page_idx (int) – The index of the current page being processed.

  • tables_and_charts (list of tuple) – A list to which extracted table/chart data will be appended. Each item is a tuple (page_idx, CroppedImageWithContent).

Notes

This function iterates over the detected table and chart objects. For each detected object, it:
  • Crops the original image based on the bounding box.

  • Converts the cropped image to a base64 encoded string.

  • Wraps the encoded image along with its bounding box and the image dimensions in a standardized data structure.

Additional model inference or post-processing can be added where needed.

Examples

>>> annotation_dict = {"table": [ [...], [...] ], "chart": [ [...], [...] ]}
>>> original_image = np.random.rand(1536, 1536, 3)
>>> tables_and_charts = []
>>> extract_tables_and_charts(annotation_dict, original_image, 0, tables_and_charts)
nv_ingest_api.util.image_processing.processing.extract_tables_and_charts_yolox(
pages: List[Tuple[int, ndarray]],
config: dict,
trace_info: List | None = None,
) List[Tuple[int, object]][source]#

Given a list of (page_index, image) tuples and a configuration dictionary, this function calls the YOLOX-based inference service to extract table and chart annotations from all pages.

Parameters:
  • pages (List[Tuple[int, np.ndarray]]) – A list of tuples containing the page index and the corresponding image.

  • config (dict) –

    A dictionary containing configuration parameters such as:
    • ’yolox_endpoints’

    • ’auth_token’

    • ’yolox_infer_protocol’

  • trace_info (Optional[List], optional) – Optional tracing information for logging/debugging purposes.

Returns:

For each page, returns a tuple (page_index, joined_content) where joined_content is the result of combining annotations from the inference.

Return type:

List[Tuple[int, object]]

nv_ingest_api.util.image_processing.table_and_chart module#

nv_ingest_api.util.image_processing.table_and_chart.assign_boxes(paddle_box, boxes, delta=2.0, min_overlap=0.25)[source]#

Assigns the closest bounding boxes to a reference paddle_box based on overlap.

Parameters:
  • paddle_box (list or numpy.ndarray) – Reference bounding box [x_min, y_min, x_max, y_max].

  • boxes (numpy.ndarray) – Array of candidate bounding boxes with shape (N, 4).

  • delta (float, optional) – Factor for matches relative to the best overlap. Defaults to 2.0.

  • min_overlap (float, optional) – Minimum required overlap for a match. Defaults to 0.25.

Returns:

Indices of the matched boxes sorted by decreasing overlap.

Returns an empty list if no matches are found.

Return type:

list

nv_ingest_api.util.image_processing.table_and_chart.build_markdown(df)[source]#

Convert a dataframe into a markdown table.

Parameters:

df (pandas DataFrame) – The dataframe to convert.

Returns:

A list of lists representing the markdown table.

Return type:

list[list]

nv_ingest_api.util.image_processing.table_and_chart.convert_paddle_response_to_psuedo_markdown(bboxes, texts)[source]#
nv_ingest_api.util.image_processing.table_and_chart.display_markdown(
data: list[list[str]],
use_header: bool = False,
) str[source]#

Convert a list of lists of strings into a markdown table.

Parameters:
  • data (list[list[str]]) – The table data. The first sublist should contain headers.

  • use_header (bool, optional) – Whether to use the first sublist as headers. Defaults to True.

Returns:

A markdown-formatted table as a string.

Return type:

str

nv_ingest_api.util.image_processing.table_and_chart.join_yolox_graphic_elements_and_paddle_output(
yolox_output,
paddle_boxes,
paddle_txts,
)[source]#

Matching boxes We need to associate a text to the paddle detections. For each class and for each CACHED detections, we look for overlapping text bboxes with IoU > max_iou / delta where max_iou is the biggest found overlap. Found texts are added to the class representation, and removed from the texts to match

nv_ingest_api.util.image_processing.table_and_chart.join_yolox_table_structure_and_paddle_output(
yolox_cell_preds,
paddle_ocr_boxes,
paddle_ocr_txts,
)[source]#
nv_ingest_api.util.image_processing.table_and_chart.match_bboxes(
yolox_box,
paddle_ocr_boxes,
already_matched=None,
delta=2.0,
)[source]#

Associates a yolox-graphic-elements box to PaddleOCR bboxes, by taking overlapping boxes. Criterion is iou > max_iou / delta where max_iou is the biggest found overlap. Boxes are expeceted in format (x0, y0, x1, y1) :param yolox_box: Cached Bbox. :type yolox_box: np array [4] :param paddle_ocr_boxes: PaddleOCR boxes :type paddle_ocr_boxes: np array [n x 4] :param already_matched: Already matched ids to ignore. :type already_matched: list or None, Optional :param delta: IoU delta for considering several boxes. Defaults to 2.. :type delta: float, Optional

Returns:

Indices of the match bboxes

Return type:

np array or list

nv_ingest_api.util.image_processing.table_and_chart.merge_text_in_cell(df_cell)[source]#

Merges text from multiple rows into a single cell and recalculates its bounding box. Values are sorted by rounded (y, x) coordinates.

Parameters:

df_cell (pandas.DataFrame) – DataFrame containing cells to merge.

Returns:

Updated DataFrame with merged text and a single bounding box.

Return type:

pandas.DataFrame

nv_ingest_api.util.image_processing.table_and_chart.process_yolox_graphic_elements(yolox_text_dict)[source]#

Process the inference results from yolox-graphic-elements model.

Parameters:

yolox_text (str) – The result from the yolox model inference.

Returns:

The concatenated and processed chart content as a string.

Return type:

str

nv_ingest_api.util.image_processing.table_and_chart.remove_empty_row(mat)[source]#

Remove empty rows from a matrix.

Parameters:

mat (list[list]) – The matrix to remove empty rows from.

Returns:

The matrix with empty rows removed.

Return type:

list[list]

nv_ingest_api.util.image_processing.transforms module#

nv_ingest_api.util.image_processing.transforms.base64_to_numpy(base64_string: str) ndarray[source]#

Convert a base64-encoded image string to a NumPy array.

Parameters:

base64_string (str) – Base64-encoded string representing an image.

Returns:

NumPy array representation of the decoded image.

Return type:

numpy.ndarray

Raises:
  • ValueError – If the base64 string is invalid or cannot be decoded into an image.

  • ImportError – If required libraries are not installed.

Examples

>>> base64_str = '/9j/4AAQSkZJRgABAQAAAQABAAD/2wBD...'
>>> img_array = base64_to_numpy(base64_str)
nv_ingest_api.util.image_processing.transforms.check_numpy_image_size(
image: ndarray,
min_height: int,
min_width: int,
) bool[source]#

Checks if the height and width of the image are larger than the specified minimum values.

Parameters: image (np.ndarray): The image array (assumed to be in shape (H, W, C) or (H, W)). min_height (int): The minimum height required. min_width (int): The minimum width required.

Returns: bool: True if the image dimensions are larger than or equal to the minimum size, False otherwise.

nv_ingest_api.util.image_processing.transforms.crop_image(
array: array,
bbox: Tuple[int, int, int, int],
min_width: int = 1,
min_height: int = 1,
) ndarray | None[source]#

Crops a NumPy array representing an image according to the specified bounding box.

Parameters:
  • array (np.array) – The image as a NumPy array.

  • bbox (Tuple[int, int, int, int]) – The bounding box to crop the image to, given as (w1, h1, w2, h2).

  • min_width (int, optional) – The minimum allowable width for the cropped image. If the cropped width is smaller than this value, the function returns None. Default is 1.

  • min_height (int, optional) – The minimum allowable height for the cropped image. If the cropped height is smaller than this value, the function returns None. Default is 1.

Returns:

The cropped image as a NumPy array, or None if the bounding box is invalid.

Return type:

Optional[np.ndarray]

nv_ingest_api.util.image_processing.transforms.ensure_base64_is_png(base64_image: str) str[source]#

Ensures the given base64-encoded image is in PNG format. Converts to PNG if necessary.

Parameters:

base64_image (str) – Base64-encoded image string.

Returns:

Base64-encoded PNG image string.

Return type:

str

nv_ingest_api.util.image_processing.transforms.normalize_image(
array: ndarray,
r_mean: float = 0.485,
g_mean: float = 0.456,
b_mean: float = 0.406,
r_std: float = 0.229,
g_std: float = 0.224,
b_std: float = 0.225,
) ndarray[source]#

Normalizes an RGB image by applying a mean and standard deviation to each channel.

Parameters:#

arraynp.ndarray

The input image array, which can be either grayscale or RGB. The image should have a shape of (height, width, 3) for RGB images, or (height, width) or (height, width, 1) for grayscale images. If a grayscale image is provided, it will be converted to RGB format by repeating the grayscale values across all three channels (R, G, B).

r_meanfloat, optional

The mean to be subtracted from the red channel (default is 0.485).

g_meanfloat, optional

The mean to be subtracted from the green channel (default is 0.456).

b_meanfloat, optional

The mean to be subtracted from the blue channel (default is 0.406).

r_stdfloat, optional

The standard deviation to divide the red channel by (default is 0.229).

g_stdfloat, optional

The standard deviation to divide the green channel by (default is 0.224).

b_stdfloat, optional

The standard deviation to divide the blue channel by (default is 0.225).

Returns:#

np.ndarray

A normalized image array with the same shape as the input, where the RGB channels have been normalized by the given means and standard deviations.

Notes:#

The input pixel values should be in the range [0, 255], and the function scales these values to [0, 1] before applying normalization.

If the input image is grayscale, it is converted to an RGB image by duplicating the grayscale values across the three color channels.

nv_ingest_api.util.image_processing.transforms.numpy_to_base64(array: ndarray) str[source]#

Converts a NumPy array representing an image to a base64-encoded string.

The function takes a NumPy array, converts it to a PIL image, and then encodes the image as a PNG in a base64 string format. The input array is expected to be in a format that can be converted to a valid image, such as having a shape of (H, W, C) where C is the number of channels (e.g., 3 for RGB).

Parameters:

array (np.ndarray) – The input image as a NumPy array. Must have a shape compatible with image data.

Returns:

The base64-encoded string representation of the input NumPy array as a PNG image.

Return type:

str

Raises:
  • ValueError – If the input array cannot be converted into a valid image format.

  • RuntimeError – If there is an issue during the image conversion or base64 encoding process.

Examples

>>> array = np.random.randint(0, 255, (100, 100, 3), dtype=np.uint8)
>>> encoded_str = numpy_to_base64(array)
>>> isinstance(encoded_str, str)
True
nv_ingest_api.util.image_processing.transforms.pad_image(
array: ~numpy.ndarray,
target_width: int = 1024,
target_height: int = 1280,
background_color: int = 255,
dtype=<class 'numpy.uint8'>,
) Tuple[ndarray, Tuple[int, int]][source]#

Pads a NumPy array representing an image to the specified target dimensions.

If the target dimensions are smaller than the image dimensions, no padding will be applied in that dimension. If the target dimensions are larger, the image will be centered within the canvas of the specified target size, with the remaining space filled with white padding.

Parameters:
  • array (np.ndarray) – The input image as a NumPy array of shape (H, W, C).

  • target_width (int, optional) – The desired target width of the padded image. Defaults to DEFAULT_MAX_WIDTH.

  • target_height (int, optional) – The desired target height of the padded image. Defaults to DEFAULT_MAX_HEIGHT.

Returns:

  • padded_array (np.ndarray) – The padded image as a NumPy array of shape (target_height, target_width, C).

  • padding_offsets (Tuple[int, int]) – A tuple containing the horizontal and vertical offsets (pad_width, pad_height) applied to center the image.

Notes

If the target dimensions are smaller than the current image dimensions, no padding will be applied in that dimension, and the image will retain its original size in that dimension.

Examples

>>> image = np.random.randint(0, 255, (600, 800, 3), dtype=np.uint8)
>>> padded_image, offsets = pad_image(image, target_width=1000, target_height=1000)
>>> padded_image.shape
(1000, 1000, 3)
>>> offsets
(100, 200)
nv_ingest_api.util.image_processing.transforms.scale_image_to_encoding_size(
base64_image: str,
max_base64_size: int = 180000,
initial_reduction: float = 0.9,
) Tuple[str, Tuple[int, int]][source]#

Decodes a base64-encoded image, resizes it if needed, and re-encodes it as base64. Ensures the final image size is within the specified limit.

Parameters:
  • base64_image (str) – Base64-encoded image string.

  • max_base64_size (int, optional) – Maximum allowable size for the base64-encoded image, by default 180,000 characters.

  • initial_reduction (float, optional) – Initial reduction step for resizing, by default 0.9.

Returns:

A tuple containing: - Base64-encoded PNG image string, resized if necessary. - The new size as a tuple (width, height).

Return type:

Tuple[str, Tuple[int, int]]

Raises:

Exception – If the image cannot be resized below the specified max_base64_size.

Module contents#

nv_ingest_api.util.image_processing.scale_image_to_encoding_size(
base64_image: str,
max_base64_size: int = 180000,
initial_reduction: float = 0.9,
) Tuple[str, Tuple[int, int]][source]#

Decodes a base64-encoded image, resizes it if needed, and re-encodes it as base64. Ensures the final image size is within the specified limit.

Parameters:
  • base64_image (str) – Base64-encoded image string.

  • max_base64_size (int, optional) – Maximum allowable size for the base64-encoded image, by default 180,000 characters.

  • initial_reduction (float, optional) – Initial reduction step for resizing, by default 0.9.

Returns:

A tuple containing: - Base64-encoded PNG image string, resized if necessary. - The new size as a tuple (width, height).

Return type:

Tuple[str, Tuple[int, int]]

Raises:

Exception – If the image cannot be resized below the specified max_base64_size.