nv_ingest_api.internal.mutate package#

Submodules#

nv_ingest_api.internal.mutate.deduplicate module#

nv_ingest_api.internal.mutate.deduplicate.calculate_iou(
bbox1: Tuple[float, ...],
bbox2: Tuple[float, ...],
) float[source]#

Calculate Intersection over Union (IoU) for two bounding boxes.

Boxes are in format (x1, y1, x2, y2) where (x1, y1) is the top-left corner and (x2, y2) is the bottom-right corner.

Parameters:
  • bbox1 (tuple) – First bounding box as (x1, y1, x2, y2).

  • bbox2 (tuple) – Second bounding box as (x1, y1, x2, y2).

Returns:

IoU value between 0.0 and 1.0.

Return type:

float

nv_ingest_api.internal.mutate.deduplicate.deduplicate_by_bbox_internal(
df_ledger: DataFrame,
iou_threshold: float = 0.45,
prefer_structured: bool = True,
) DataFrame[source]#

Remove duplicate visual elements based on bounding box overlap.

When an IMAGE element’s bounding box substantially overlaps with a STRUCTURED element (table/chart/infographic) on the same page, one is removed based on the prefer_structured flag.

Parameters:
  • df_ledger (pd.DataFrame) – DataFrame with document_type, metadata columns.

  • iou_threshold (float) – Minimum IoU to consider elements as duplicates (default 0.4).

  • prefer_structured (bool) – If True, keep structured elements and drop images when duplicates found. If False, keep images and drop structured elements.

Returns:

DataFrame with bbox-based duplicates removed.

Return type:

pd.DataFrame

nv_ingest_api.internal.mutate.deduplicate.deduplicate_images_internal(
df_ledger: DataFrame,
task_config: Dict[str, Any],
mutate_config: ImageDedupSchema = ImageDedupSchema(raise_on_failure=False),
execution_trace_log: List[Any] | None = None,
) DataFrame[source]#

Deduplicate images in a DataFrame based on content hashes and/or bounding box overlap.

The function processes rows where the ‘document_type’ is IMAGE, computes a content hash for each, and then either removes duplicates or marks them based on the ‘filter’ flag in task_config. A ‘hash_algorithm’ flag in task_config determines the algorithm used for hashing.

Additionally, if ‘enable_bbox_dedup’ is True, removes images that substantially overlap with structured elements (tables/charts) based on IoU threshold.

Parameters:
  • df_ledger (pd.DataFrame) – DataFrame containing at least ‘document_type’ and ‘metadata’ columns.

  • task_config (dict) –

    Configuration parameters, including:
    • ”filter”: bool, if True duplicate rows are removed; if False, duplicates are marked.

    • ”hash_algorithm”: str, the algorithm to use for hashing (default “md5”).

    • ”enable_bbox_dedup”: bool, if True also deduplicate by bounding box overlap.

    • ”iou_threshold”: float, IoU threshold for bbox dedup (default 0.45).

    • ”bbox_dedup_prefer_structured”: bool, if True keep structured elements (default True).

  • mutate_config (ImageDedupSchema, optional)

  • execution_trace_log (Optional[List[Any]], optional)

Returns:

The DataFrame with duplicate images either removed or marked.

Return type:

pd.DataFrame

Raises:
  • ValueError – If the required columns are missing.

  • Exception – For any other errors encountered during deduplication.

nv_ingest_api.internal.mutate.filter module#

nv_ingest_api.internal.mutate.filter.filter_images_internal(
df_ledger: DataFrame,
task_config: Dict[str, Any],
mutate_config: ImageFilterSchema = ImageFilterSchema(raise_on_failure=False, cpu_only=False),
execution_trace_log: List[Any] | None = None,
) DataFrame[source]#

Apply an image filtering operation to a DataFrame based on average image size and aspect ratio.

Parameters:
  • df_ledger (pd.DataFrame) – DataFrame to be filtered. Must contain ‘document_type’ and ‘metadata’ columns.

  • task_config (dict) –

    Dictionary with the following keys:
    • ”min_size”: Minimum average image size threshold.

    • ”max_aspect_ratio”: Maximum allowed aspect ratio.

    • ”min_aspect_ratio”: Minimum allowed aspect ratio.

    • ”filter”: If True, rows failing the criteria are dropped; if False, they are flagged.

  • mutate_config (ImageFilterSchema)

  • execution_trace_log (Optional[List[Any]], optional)

Returns:

The updated DataFrame after applying the image filter.

Return type:

pd.DataFrame

Raises:
  • ValueError – If required columns are missing or if parameters are invalid.

  • Exception – For other errors encountered during filtering.

Module contents#