nemo_curator.stages.text.deduplication.removal
nemo_curator.stages.text.deduplication.removal
Removal stage for distributed deduplication pipeline.
This stage implements the removal phase of the distributed deduplication approach:
- Takes a DocumentBatch and determines the min/max ID range
- Filters the parquet files for IDs to remove within this range
- Filters out documents based on the removal list
- Returns the filtered DocumentBatch
Module Contents
Classes
API
Bases: ProcessingStage[DocumentBatch, DocumentBatch]
Stage for removing duplicate documents based on pre-computed removal lists.
Parameters:
Path to parquet files containing IDs to remove
Field to use for deduplication within the input dataframe. Defaults to CURATOR_DEDUP_ID_STR.
Field to use for deduplication within the removal dataframe. Defaults to “id”.
Additional arguments for reading parquet files
Initialize parent class after dataclass initialization.
Our deduplicator should’ve written out a parquet file with the IDs to remove. We read that file, filter the input dataframe to only include the IDs to remove, and return the filtered dataframe. We optimize by not loading the whole ids to remove into memory, but only loading the ids that are in the range of the input dataframe.