modules.semantic_dedup.semdedup
#
Module Contents#
Classes#
Base class for all NeMo Curator deduplication modules. |
API#
- class modules.semantic_dedup.semdedup.SemDedup(
- config: nemo_curator.modules.config.SemDedupConfig,
- input_column: str = 'text',
- id_column: str = 'id',
- perform_removal: bool = False,
- logger: logging.Logger | str = './',
Bases:
nemo_curator.modules.base.BaseDeduplicationModule
Base class for all NeMo Curator deduplication modules.
Initialization
Initialize the SemDedup class.
Args: config (SemDedupConfig): Configuration for SemDedup. input_column (str): Column name from the data to be used for embedding generation. Default is “text”. id_column (str): Column name used as the identifier in the dataset. Default is “id”. perform_removal (bool): Whether to remove duplicates from the dataset. Default is False. logger (Union[logging.Logger, str]): Existing logger to log to, or a path to a log directory. Default is “./”.
- identify_duplicates(
- dataset: nemo_curator.datasets.DocumentDataset,
Identify duplicates in the dataset. Returns a list of ids that are duplicates to each other.
- remove(
- dataset: nemo_curator.datasets.DocumentDataset,
- duplicates_to_remove: nemo_curator.datasets.DocumentDataset,
Remove duplicates from the dataset.