nv_ingest_api.internal.mutate package#
Submodules#
nv_ingest_api.internal.mutate.deduplicate module#
- nv_ingest_api.internal.mutate.deduplicate.deduplicate_images_internal(
- df_ledger: DataFrame,
- task_config: Dict[str, Any],
- mutate_config: ImageDedupSchema = ImageDedupSchema(raise_on_failure=False),
- execution_trace_log: List[Any] | None = None,
Deduplicate images in a DataFrame based on content hashes.
The function processes rows where the ‘document_type’ is IMAGE, computes a content hash for each, and then either removes duplicates or marks them based on the ‘filter’ flag in task_config. A ‘hash_algorithm’ flag in task_config determines the algorithm used for hashing.
- Parameters:
df_ledger (pd.DataFrame) – DataFrame containing at least ‘document_type’ and ‘metadata’ columns.
task_config (dict) –
- Configuration parameters, including:
”filter”: bool, if True duplicate rows are removed; if False, duplicates are marked.
”hash_algorithm”: str, the algorithm to use for hashing (default “md5”).
mutate_config (ImageDedupSchema, optional)
execution_trace_log (Optional[List[Any]], optional)
- Returns:
The DataFrame with duplicate images either removed or marked.
- Return type:
pd.DataFrame
- Raises:
ValueError – If the required columns are missing.
Exception – For any other errors encountered during deduplication.
nv_ingest_api.internal.mutate.filter module#
- nv_ingest_api.internal.mutate.filter.filter_images_internal(
- df_ledger: DataFrame,
- task_config: Dict[str, Any],
- mutate_config: ImageFilterSchema = ImageFilterSchema(raise_on_failure=False, cpu_only=False),
- execution_trace_log: List[Any] | None = None,
Apply an image filtering operation to a DataFrame based on average image size and aspect ratio.
- Parameters:
df_ledger (pd.DataFrame) – DataFrame to be filtered. Must contain ‘document_type’ and ‘metadata’ columns.
task_config (dict) –
- Dictionary with the following keys:
”min_size”: Minimum average image size threshold.
”max_aspect_ratio”: Maximum allowed aspect ratio.
”min_aspect_ratio”: Minimum allowed aspect ratio.
”filter”: If True, rows failing the criteria are dropped; if False, they are flagged.
mutate_config (ImageFilterSchema)
execution_trace_log (Optional[List[Any]], optional)
- Returns:
The updated DataFrame after applying the image filter.
- Return type:
pd.DataFrame
- Raises:
ValueError – If required columns are missing or if parameters are invalid.
Exception – For other errors encountered during filtering.