Downstream Task Decontamination/Deduplication

After training, large language models are usually evaluated by their performance on downstream tasks consisting of unseen test data. When dealing with large datasets there is a potential for leakage of this test data into the model’s training dataset. Therefore, NeMo Data Curator follows the approach of OpenAI GPT3 and Microsoft Turing NLG 530B to remove sections of documents in your dataset that are present in downstream tasks.

While the following steps can be run manually using the commands given, we also provide a SLURM script in the examples folder that follows the same procedure. It must be filled in with the necessary parameters described below before running.

Within the NeMo Data Curator, users can use the prepare_task_data, find_matching_ngrams and remove_matching_ngrams modules in order to remove any task data that might be contained (i.e., “contaminate”) within their training data. You will need a list of your downstream tasks to modify the task config (lm_tasks.yaml). If your task does not already exist as a class, you will need to construct a class that extends ndc.deduplication.task.lmtask.DownstreamTask.

Then, you can start by constructing the n-grams from the task documents using the prepare_task_data module. This module requires an input configuration file that contains the different modules that describe how to form N-grams from the task data of interest. An example of a configuration file is provided in config/lm_tasks.yaml. A number of tasks are already implemented within the NeMo Data Curator and can be found within ndc/deduplication/task/lmtask.py. Should users desire to add their own tasks, they can prescribe their own class similar to those defined in ndc/deduplication/task/lmtask.py. Once all N-grams have been computed, they are written as keys of a dictionary that is written to a pickle file. This step only needs to be done once per set of tasks. This pickle file can be reused across datasets that share the same downstream tasks.

Copy
Copied!
            

prepare_task_data \ --task-config-file=./config/lm_tasks.yaml \ --output-task-ngrams=./data/task_ngrams.pkl \ --log-dir=./log/prepare_task_data

Once users have computed the task N-grams, they can use the find_matching_ngrams module in order to search for matches within their corpus. This module task as input the path to the users dataset consisting of JSONL files as well as precomputed task N-grams, and as output provides a pickle file consisting of the count of how many times a specific task N-gram ocurred within the training set. This N-gram count will be used in the final step to determine if an a document should be split and the N-gram removed.

Copy
Copied!
            

find_matching_ngrams \ --input-data-dir=<Path to the input directory containing jsonl files> \ --input-task-ngrams=./data/task_ngrams.pkl \ --output-matched-ngram-data=./data/matched_ngrams.pkl \ --log-dir=./log/find_matching_ngrams

As a final step in the task decontamination procedure, the counts associated with the matched N-grams are used to determine if a particular N-gram should be removed from the training corpus. If the N-gram has a count that is higher than a user-defined threshold, it is not considered. Otherwise, it is considered and will be removed from the corpus. When an N-gram is removed from the corpus, a user-defined character window that extends from the N-gram in both directions is also removed from the corpus. Additionally, the document will be split into two separate documents. If the split document is too short after splitting, it will be removed. Additionally, documents that are split more than a user-defined number of times are also removed from the corpus. For more information on the task decontamination procedure, please see Brown et al., 2020 and Smith et al., 2021

Copy
Copied!
            

remove_matching_ngrams \ --input-data-dir=<Path to the input directory containing jsonl files> \ --input-matched-ngrams=./data/matched_ngrams.pkl \ --output-task-deduped-dir=<Output directory containing task-deduped jsonl files>\ --log-dir=./log/remove_matching_ngrams \

Previous Classifier and Heuristic Quality Filtering
Next PII Identification and Removal
© Copyright 2023-2024, NVIDIA. Last updated on Feb 22, 2024.