nv_ingest_api.internal.mutate package#

Submodules#

nv_ingest_api.internal.mutate.deduplicate module#

nv_ingest_api.internal.mutate.deduplicate.deduplicate_images_internal(
df_ledger: DataFrame,
task_config: Dict[str, Any],
mutate_config: ImageDedupSchema = ImageDedupSchema(raise_on_failure=False),
execution_trace_log: List[Any] | None = None,
) DataFrame[source]#

Deduplicate images in a DataFrame based on content hashes.

The function processes rows where the ‘document_type’ is IMAGE, computes a content hash for each, and then either removes duplicates or marks them based on the ‘filter’ flag in task_config. A ‘hash_algorithm’ flag in task_config determines the algorithm used for hashing.

Parameters:
  • df_ledger (pd.DataFrame) – DataFrame containing at least ‘document_type’ and ‘metadata’ columns.

  • task_config (dict) –

    Configuration parameters, including:
    • ”filter”: bool, if True duplicate rows are removed; if False, duplicates are marked.

    • ”hash_algorithm”: str, the algorithm to use for hashing (default “md5”).

  • mutate_config (ImageDedupSchema, optional)

  • execution_trace_log (Optional[List[Any]], optional)

Returns:

The DataFrame with duplicate images either removed or marked.

Return type:

pd.DataFrame

Raises:
  • ValueError – If the required columns are missing.

  • Exception – For any other errors encountered during deduplication.

nv_ingest_api.internal.mutate.filter module#

nv_ingest_api.internal.mutate.filter.filter_images_internal(
df_ledger: DataFrame,
task_config: Dict[str, Any],
mutate_config: ImageFilterSchema = ImageFilterSchema(raise_on_failure=False, cpu_only=False),
execution_trace_log: List[Any] | None = None,
) DataFrame[source]#

Apply an image filtering operation to a DataFrame based on average image size and aspect ratio.

Parameters:
  • df_ledger (pd.DataFrame) – DataFrame to be filtered. Must contain ‘document_type’ and ‘metadata’ columns.

  • task_config (dict) –

    Dictionary with the following keys:
    • ”min_size”: Minimum average image size threshold.

    • ”max_aspect_ratio”: Maximum allowed aspect ratio.

    • ”min_aspect_ratio”: Minimum allowed aspect ratio.

    • ”filter”: If True, rows failing the criteria are dropped; if False, they are flagged.

  • mutate_config (ImageFilterSchema)

  • execution_trace_log (Optional[List[Any]], optional)

Returns:

The updated DataFrame after applying the image filter.

Return type:

pd.DataFrame

Raises:
  • ValueError – If required columns are missing or if parameters are invalid.

  • Exception – For other errors encountered during filtering.

Module contents#