`modules.task`#

Module Contents#

Classes#

TaskDecontamination

Base class for all NeMo Curator modules.

API#

class modules.task.TaskDecontamination( tasks: nemo_curator.tasks.downstream_task.DownstreamTask | collections.abc.Iterable[nemo_curator.tasks.downstream_task.DownstreamTask], text_field: str = 'text', max_ngram_size: int = 13, max_matches: int = 10, min_document_length: int = 200, remove_char_each_side: int = 200, max_splits: int = 10, removed_dir: str | None = None, )#

Bases: nemo_curator.modules.base.BaseModule

Base class for all NeMo Curator modules.

Handles validating that data lives on the correct device for each module

Initialization

Removes segments of downstream evaluation tasks from a dataset Args: max_ngram_size: The maximum amount of task grams that are considered at once for contamination. max_matches: If an ngram is found more than max_matches times, it is considered too common and will not be removed from the dataset. min_document_length: When a document is split, if a split falls below this character length it is discarded. remove_char_each_side: The number of characters to remove on either side of the matching ngram max_splits: The maximum number of times a document may be split before being entirely discarded. removed_dir: If not None, the documents split too many times will be written to this directory using the filename in the dataset.

call( dataset: nemo_curator.datasets.DocumentDataset, ) → nemo_curator.datasets.DocumentDataset#

Performs an arbitrary operation on a dataset

Args: dataset (DocumentDataset): The dataset to operate on

find_matching_ngrams( task_ngrams: dict, dataset: nemo_curator.datasets.DocumentDataset, ) → dict#

prepare_task_ngram_count() → dict#: Computes a dictionary of all ngrams in each task as keys and each value set to 0.

remove_matching_ngrams( matched_ngrams: dict, ngram_freq: list[tuple], dataset: nemo_curator.datasets.DocumentDataset, ) → nemo_curator.datasets.DocumentDataset#

modules.task#

Module Contents#

Classes#

API#

`modules.task`#