modules.task
#
Module Contents#
Classes#
Base class for all NeMo Curator modules. |
API#
- class modules.task.TaskDecontamination(
- tasks: nemo_curator.tasks.downstream_task.DownstreamTask | collections.abc.Iterable[nemo_curator.tasks.downstream_task.DownstreamTask],
- text_field: str = 'text',
- max_ngram_size: int = 13,
- max_matches: int = 10,
- min_document_length: int = 200,
- remove_char_each_side: int = 200,
- max_splits: int = 10,
- removed_dir: str | None = None,
Bases:
nemo_curator.modules.base.BaseModule
Base class for all NeMo Curator modules.
Handles validating that data lives on the correct device for each module
Initialization
Removes segments of downstream evaluation tasks from a dataset Args: max_ngram_size: The maximum amount of task grams that are considered at once for contamination. max_matches: If an ngram is found more than max_matches times, it is considered too common and will not be removed from the dataset. min_document_length: When a document is split, if a split falls below this character length it is discarded. remove_char_each_side: The number of characters to remove on either side of the matching ngram max_splits: The maximum number of times a document may be split before being entirely discarded. removed_dir: If not None, the documents split too many times will be written to this directory using the filename in the dataset.
- call(
- dataset: nemo_curator.datasets.DocumentDataset,
Performs an arbitrary operation on a dataset
Args: dataset (DocumentDataset): The dataset to operate on
- find_matching_ngrams(
- task_ngrams: dict,
- dataset: nemo_curator.datasets.DocumentDataset,
- prepare_task_ngram_count() dict #
Computes a dictionary of all ngrams in each task as keys and each value set to 0.
- remove_matching_ngrams(
- matched_ngrams: dict,
- ngram_freq: list[tuple],
- dataset: nemo_curator.datasets.DocumentDataset,