modules.semantic_dedup.semdedup#

Module Contents#

Classes#

SemDedup

Base class for all NeMo Curator deduplication modules.

API#

class modules.semantic_dedup.semdedup.SemDedup(
config: nemo_curator.modules.config.SemDedupConfig,
input_column: str = 'text',
id_column: str = 'id',
perform_removal: bool = False,
logger: logging.Logger | str = './',
)#

Bases: nemo_curator.modules.base.BaseDeduplicationModule

Base class for all NeMo Curator deduplication modules.

Initialization

Initialize the SemDedup class.

Args: config (SemDedupConfig): Configuration for SemDedup. input_column (str): Column name from the data to be used for embedding generation. Default is “text”. id_column (str): Column name used as the identifier in the dataset. Default is “id”. perform_removal (bool): Whether to remove duplicates from the dataset. Default is False. logger (Union[logging.Logger, str]): Existing logger to log to, or a path to a log directory. Default is “./”.

identify_duplicates(
dataset: nemo_curator.datasets.DocumentDataset,
) nemo_curator.datasets.DocumentDataset#

Identify duplicates in the dataset. Returns a list of ids that are duplicates to each other.

remove(
dataset: nemo_curator.datasets.DocumentDataset,
duplicates_to_remove: nemo_curator.datasets.DocumentDataset,
) nemo_curator.datasets.DocumentDataset#

Remove duplicates from the dataset.