Downstream Task Decontamination#
Background#
After training, large language models are usually evaluated by their performance on downstream tasks consisting of unseen test data. When dealing with large datasets there is a potential for leakage of this test data into the model’s training dataset. Therefore, NVIDIA NeMo Curator follows the approach of OpenAI GPT3 and Microsoft Turing NLG 530B to remove sections of documents in your dataset that are present in downstream tasks.
Usage#
The TaskDecontamination
module provides the central functionality in NVIDIA NeMo Curator. Let’s examine this small example:
import nemo_curator as nc
from nemo_curator.datasets import DocumentDataset
from nemo_curator.utils.file_utils import get_all_files_paths_under
from nemo_curator.tasks import Winogrande, Squad, TriviaQA
files = get_all_files_paths_under("books_dataset/", keep_extensions="jsonl")
books = DocumentDataset.read_json(files, add_filename=True)
downstream_tasks = [
Winogrande(),
Squad(),
TriviaQA(),
]
task_decontaminate = nc.TaskDecontamination(downstream_tasks)
decontaminated_books = task_decontaminate(books)
decontaminated_books.to_json("decontaminated_books/", write_to_filename=True)
Parameters#
The TaskDecontamination
class accepts several parameters to control the decontamination process:
Parameter |
Default |
Description |
---|---|---|
|
Required |
A single task or list of |
|
|
Field in dataset containing document text |
|
|
Maximum size of n-grams to check for contamination |
|
|
If an n-gram appears more than this many times, it’s considered too common and not removed |
|
|
Minimum character length for split documents to be kept |
|
|
Number of characters to remove on either side of matching n-gram |
|
|
Maximum number of splits allowed before discarding document entirely |
|
|
Optional directory to save discarded documents |
For example, to use more aggressive removal settings:
task_decontaminate = nc.TaskDecontamination(
tasks=downstream_tasks,
max_ngram_size=10, # Use smaller n-grams for matching
max_matches=5, # Remove n-grams that appear in fewer documents
remove_char_each_side=300, # Remove more context around matches
min_document_length=500 # Keep only longer document fragments
)
Available Downstream Tasks#
NVIDIA NeMo Curator provides implementations for many common benchmark tasks. Here’s a comprehensive list of the supported tasks:
Task Category |
Available Tasks |
---|---|
Question Answering |
|
Reading Comprehension |
|
Commonsense Reasoning |
|
Natural Language Inference |
|
Knowledge Tasks |
|
Multi-task Benchmarks |
|
Specialized Tasks |
|
You can import these tasks directly from the nemo_curator.tasks
module:
from nemo_curator.tasks import Squad, TriviaQA, MMLU, Winogrande, ANLI
Task Decontamination Process#
If you’d like more fine-grained control over the task decontamination process, NVIDIA NeMo Curator provides several CLI tools you can manually apply. You can use the prepare_task_data
, find_matching_ngrams
and remove_matching_ngrams
scripts to remove any task data that might be contained (that’s “contaminate”) within your training data. You’ll need a list of your downstream tasks to modify the task configuration file (lm_tasks.yaml). If your task doesn’t already exist as a class, you’ll need to construct a class that extends nemo_curator.tasks.DownstreamTask
.
1. Prepare Task N-grams#
First, construct the n-grams from task documents using the prepare_task_data
module:
prepare_task_data \
--task-config-file=./config/lm_tasks.yaml \
--output-task-ngrams=./data/task_ngrams.pkl
This module requires a configuration file that specifies how to form n-grams from the task data. An example configuration is provided in config/lm_tasks.yaml
. This step only needs to be done once per set of tasks, and the resulting pickle file can be reused across datasets.
The n-gram generation process:
Extracts text from each task’s test examples
Tokenizes the text into words
Creates n-grams of varying sizes (up to
max_ngram_size
)Stores these n-grams in a dictionary
2. Find Matching N-grams#
Next, use the find_matching_ngrams
module to search for matches within your corpus:
find_matching_ngrams \
--input-data-dir=<Path to the input directory containing jsonl files> \
--input-task-ngrams=./data/task_ngrams.pkl \
--output-matched-ngram-data=./data/matched_ngrams.pkl
This module:
Loads the precomputed task n-grams
Searches each document in your dataset for these n-grams
Counts occurrences of each n-gram across the entire corpus
Outputs a dictionary of n-grams and their frequencies
3. Remove Matching N-grams#
Finally, use the remove_matching_ngrams
module to remove contaminated content:
remove_matching_ngrams \
--input-data-dir=<Path to the input directory containing jsonl files> \
--input-matched-ngrams=./data/matched_ngrams.pkl \
--output-task-deduped-dir=<Output directory containing task-deduped jsonl files>
This module:
Loads the matched n-grams and their frequencies
Identifies n-grams that appear fewer than
max_matches
times (rare enough to be actual task contamination)For each document containing these n-grams:
Removes the n-gram and surrounding characters (up to
remove_char_each_side
on each side)Splits the document at removal points
Keeps only the split fragments longer than
min_document_length
Discards documents that require more than
max_splits
Creating Custom Downstream Tasks#
If you need to decontaminate against a task not included in NeMo Curator, you can create your own task class:
from nemo_curator.tasks import DownstreamTask
class MyCustomTask(DownstreamTask):
def __init__(self):
super().__init__()
self._task_name = "my_custom_task"
def generate_ngrams(self):
# Load your task's test data
test_examples = load_my_test_data()
# Process each example and update ngrams
for example in test_examples:
# If your task has multiple text fields, process each one
self._update_ngrams(example["question"])
self._update_ngrams(example["context"])
return self._ngrams
You can then use this custom task with the TaskDecontamination
module:
task_decontaminate = nc.TaskDecontamination([MyCustomTask()])
Performance Considerations#
Task decontamination can be computationally intensive for large datasets. Consider these optimization strategies:
Prioritize important tasks: Start with the most critical benchmark tasks for your application
Process in batches: Decontaminate your dataset in manageable chunks
Save intermediate results: Store the results from each step of the CLI workflow
Adjust n-gram size: Smaller values of
max_ngram_size
reduce computation but may increase false positives