modules.task#

Module Contents#

Classes#

TaskDecontamination

Base class for all NeMo Curator modules.

API#

class modules.task.TaskDecontamination(
tasks: nemo_curator.tasks.downstream_task.DownstreamTask | collections.abc.Iterable[nemo_curator.tasks.downstream_task.DownstreamTask],
text_field: str = 'text',
max_ngram_size: int = 13,
max_matches: int = 10,
min_document_length: int = 200,
remove_char_each_side: int = 200,
max_splits: int = 10,
removed_dir: str | None = None,
)#

Bases: nemo_curator.modules.base.BaseModule

Base class for all NeMo Curator modules.

Handles validating that data lives on the correct device for each module

Initialization

Removes segments of downstream evaluation tasks from a dataset Args: max_ngram_size: The maximum amount of task grams that are considered at once for contamination. max_matches: If an ngram is found more than max_matches times, it is considered too common and will not be removed from the dataset. min_document_length: When a document is split, if a split falls below this character length it is discarded. remove_char_each_side: The number of characters to remove on either side of the matching ngram max_splits: The maximum number of times a document may be split before being entirely discarded. removed_dir: If not None, the documents split too many times will be written to this directory using the filename in the dataset.

call(
dataset: nemo_curator.datasets.DocumentDataset,
) nemo_curator.datasets.DocumentDataset#

Performs an arbitrary operation on a dataset

Args: dataset (DocumentDataset): The dataset to operate on

find_matching_ngrams(
task_ngrams: dict,
dataset: nemo_curator.datasets.DocumentDataset,
) dict#
prepare_task_ngram_count() dict#

Computes a dictionary of all ngrams in each task as keys and each value set to 0.

remove_matching_ngrams(
matched_ngrams: dict,
ngram_freq: list[tuple],
dataset: nemo_curator.datasets.DocumentDataset,
) nemo_curator.datasets.DocumentDataset#