nemo_curator.stages.image.deduplication.removal
nemo_curator.stages.image.deduplication.removal
Module Contents
Classes
API
Dataclass
Bases: ProcessingStage[ImageBatch, ImageBatch]
Filter stage that removes images whose IDs appear in a Parquet file.
The Parquet file must contain a column with image identifiers; by default this
column is assumed to be id to match writer metadata. You can change
the column name via duplicate_id_field.
Parameters:
removal_parquets_dir
Directory containing Parquet files with image IDs to remove
duplicate_id_field
Name of the column containing image IDs to remove
verbose
Whether to log verbose output
num_workers_per_node
Number of workers per node for the stage. This is sometimes needed to avoid OOM when concurrently running actors on one node loading the same removal parquet files into memory.
_ids_to_remove
duplicate_id_field
name
num_workers_per_node
removal_parquets_dir
verbose