Within the NeMo Data Curator, users can use the prepare_task_data
, find_matching_ngrams
and remove_matching_ngrams
modules in order to remove any task data that might be contained (i.e., “contaminate”) within their training data.
This can be accomplished as follows:
This module requires an input configuration file that contains the different modules that describe how to form N-grams from the task data of interest. An example of a configuration file is provided in config/lm_tasks.yaml
. A number of tasks are already implemented within the NeMo Data Curator and can be found within ndc/deduplication/task/lmtask.py
. Should users desire to add their own tasks, they can prescribe their own class similar to those defined in ndc/deduplication/task/lmtask.py
. Once all N-grams have been computed, they are written as keys of a dictionary that is written to a pickle file.
Once users have computed the task N-grams, they can use the find_matching_ngrams
module in order to search for matches within their corpus. This module task as input the path to the users dataset consisting of JSONL files as well as precomputed task N-grams, and as output provides a pickle file consisting of the count of how many times a specific task N-gram ocurred within the training set. This N-gram count will be used in the final step to determine if an a document should be split and the N-gram removed.
As a final step in the task decontamination procedure, the counts associated with the matched N-grams are used to determine if a particular N-gram should be removed from the training corpus. If the N-gram has a count that is higher than a user-defined threshold, it is not considered. Otherwise, it is considered and will be removed from the corpus. When an N-gram is removed from the corpus, a user-defined character window that extends from the N-gram in both directions is also removed from the corpus. Additionally, the document will be split into two separate documents. If the split document is too short after splitting, it will be removed. Additionally, documents that are split more than a user-defined number of times are also removed from the corpus. For more information on the task decontamination procedure, please see Brown et al., 2020 and Smith et al., 2021
An example of how to run each of these stages to decontaminate task N-grams from training documents can be found in the task_deduplication.sh
script in the examples
directory.