Classifier and Heuristic Quality Filtering

Large datasets often contain many documents considered to be “low quality”. In this context, “low quality” data simply means data we don’t want a downstream model to learn from, and “high quality” data is data that we do want a downstream model to learn from. The metrics that define quality can vary. There are heuristics that measure quality by gathering simple statistics like how many punctutation marks a document has, how long is the document, and how repetitive is the document. You can then filter documents by these statistics. In contrast, you may have a high quality collection of data that you want a new dataset to align with. You could train a simple classifier to differentiate between documents that look similar to those high quality documents and documents that do not.

NeMo Curator offers modules for both kinds of filtering, and it provides an easy interface for adding your own filters and combining them with existing ones. You can also use these modules to collect statistics and metadata on your documents without removing any of them. There are 30+ filters available for English, non-English, and code datasets.

The ScoreFilter is at the center of the filtering in NeMo Curator. Let’s examine this small example:

Copy
Copied!
            

import nemo_curator as nc from nemo_curator.datasets import DocumentDataset from nemo_curator.utils.file_utils import get_all_files_paths_under from nemo_curator.filters import WordCountFilter files = get_all_files_paths_under("books_dataset/") books = DocumentDataset.read_json(files, add_filename=True) filter_step = nc.ScoreFilter( WordCountFilter(min_words=80), text_field="text", score_field="word_count", ) long_books = filter_step(books) long_books.to_json("long_books/", write_to_filename=True)

The central part to focus on is the creation of the filter_step. WordCountFilter(min_words=80) creates and configures a filter object. A filter object is a class that inherits from the abstract base class nemo_curator.filters.DocumentFilter. This base class requires the inheritor to implement two methods, score_document and keep_document. For this example, let’s look at a simplified version of the WordCountFilter.

Copy
Copied!
            

class WordCountFilter(DocumentFilter): def __init__(self, min_words=50, max_words=100000, lang='en'): self._min_words = min_words self._max_words = max_words self._word_splitter = get_word_splitter(lang) self._name = 'word_count' def score_document(self, text: str): return len(self._word_splitter(text)) def keep_document(self, score: int): return self._min_words <= score <= self._max_words

With this implementation, it becomes clear what each function is doing. score_document takes the text of a document, and returns the number of words in the document. keep_document takes in the score outputted by score_document (the number of words in this case) and returns True if the score indicates the document should be kept and False if the document should be removed. Now, it’s important to note that WordCountFilter and DocumentFilter only operate on a single document. In order to apply the filter to a whole dataset, we must use ScoreFilter.

Copy
Copied!
            

filter_step = nc.ScoreFilter( WordCountFilter(min_words=80), text_field="text", score_field="word_count", )

The construction of ScoreFilter creates a function that can be applied to a DocumentDataset instead of just a single document. text_field designates the field in the dataset that holds the documents that should get passed to the filter’s score_document function. score_field is an optional argument that allows you to record the score in the given metadata field of the document, and if specified, it will be written to disk with the rest of the metadata.

In some cases, the dataset may come with metadata that you want to filter directly. Or, you might want to simply add a new piece of metadata without filtering on it. The Filter and Score modules allow you to accomplish each task respectively.

For example, if the dataset in the above example came pre-populated with the word_count field, you could rewrite it as follows:

Copy
Copied!
            

books = DocumentDataset.read_json(files, add_filename=True) filter_step = nc.Filter( WordCountFilter(min_words=80).keep_document, filter_field="word_count", ) long_books = filter_step(books) long_books.to_json("long_books/", write_to_filename=True)

Alternatively, if you simply want to track the length of the words in the documents and not filter based on them, you could rewrite it as follows:

Copy
Copied!
            

books = DocumentDataset.read_json(files, add_filename=True) filter_step = nc.Score( WordCountFilter(min_words=80).score_document, text_field="text", score_field="word_count", ) annotated_books = filter_step(books) annotated_books.to_json("annotated_books/", write_to_filename=True)

Batched Filtering

While the scoring and filtering functions defined above operate on single documents, NeMo Curator can take advantage of functions that operate in batches for improved performance. To accomplish this, you can annotate your functions with the batched decorator. This decorator will cause a pandas series of documents/scores to be passed to the function instead of a single document/score. Here is the WordCountFilter rewritten to use batches in the keep_document.

Copy
Copied!
            

from nemo_curator.utils.decorators import batched class WordCountFilter(DocumentFilter): def __init__(self, min_words=50, max_words=100000, lang='en'): self._min_words = min_words self._max_words = max_words self._word_splitter = get_word_splitter(lang) self._name = 'word_count' def score_document(self, text: str): return len(self._word_splitter(text)) @batched def keep_document(self, scores: pd.Series): pass_min = self._min_words <= scores pass_max = score <= self._max_words return pass_min & pass_max

The classifier-based filtering approach we have implemented follows closely to that used in Brown et al., 2020, and trains a binary skip-gram classifier that can be used to distinguish between low and high quality documents. To implement this, we use the functions provided by fastText. Following the examples provided in the fastText documentation, we first create a file consisting of high and low-quality training documents. We provide an example of how to train and use a model in examples/classifier_filtering.py.

We also provide CLI scripts for the same functionality. The prepare_fasttext_training_data script will randomly sample documents from an input dataset and will prepare them to be used to train a fasText skip-gram classifier. For a high-quality dataset we recommend sampling from either OpenWebText2 or Wikipedia and an unfiltered version of Common Crawl can be used for a low-quality dataset.

Copy
Copied!
            

prepare_fasttext_training_data \ --input-data-dir=<Specify the path to common-crawl/low-quality data> \ --output-num-samples=<Specify the number of low-quality documents to be used for training> \ --label='__label__cc' \ --output-train-file=${res_dir}/cc_samples.txt \ prepare_fasttext_training_data \ --input-data-dir=<Specify the path to high-quality data> \ --output-num-samples=<Specify the number of high-quality documents to be used for training> \ --label='__label__hq' \ --output-train-file=${res_dir}/hq_samples.txt \

Once the samples have been prepared and written to .txt files, users can use the train_fasttext script that reads in the samples within the .txt files in order to train a quality classifier. train_fasttext will read in all of the samples within the .txt files, split the data into training and validation sets and train the binary skip-gram classifier. After training, it evaluates the model on the validation samples and writes the predictions to a jsonl file prints the confusion matrix to stdout.

Copy
Copied!
            

train_fasttext \ --fasttext-files-dir=${res_dir} \ --output-train-file=${res_dir}/fasttext_samples.train \ --output-validation-file=${res_dir}/fasttext_samples.valid \ --output-model=${res_dir}/cc_filter_test.bin \ --output-predictions=${res_dir}/preds.jsonl

Finally, with the model trained and able to provide quality scores, it can be used to for quality filtering. Similar to how filter_documents performs language identification with the fastText model lid.176.bin, we provide a default config that can be used for classifier-based quality filtering with a fastText model. Additionally, this filter implements Pareto-based sampling approach as is described in Brown et al., 2020.

Copy
Copied!
            

filter_documents \ --input-data-dir=<Specify the path to common-crawl/uncurated data> \ --filter-config-file=./config/fasttext_quality_filter.yaml \ --output-retained-document-dir=<Output directory to which high-quality documents will be written> \ --output-removed-document-dir=<Output directory to which low-quality documents will be written> \ --log-dir=${log_dir}/fasttext_classifier \

As with other filtering steps, the heuristic-based filtering in NeMo Curator can be carried out using ScoreFilter or the filter_documents utility. Filters can be chained in NeMo Curator using Sequential as follows.

Copy
Copied!
            

filter_step = nc.Sequential([ ScoreFilter( WordCountFilter(min_words=80), score_field="word_count", ), ScoreFilter(IncompleteStoryFilter()), ScoreFilter(RepeatingTopNGramsFilter(n=2, max_repeating_ngram_ratio=0.2)), ScoreFilter(RepeatingTopNGramsFilter(n=3, max_repeating_ngram_ratio=0.18)), ScoreFilter(RepeatingTopNGramsFilter(n=4, max_repeating_ngram_ratio=0.16)), ])

The filter config file config/heuristic_filter.yaml provides a generic list of heuristic filters that have been tested and shown to provide documents that when used for training, lead to improvements in language model downstream task performance. The filters are general enough that users should feel free to remove certain filters within the cascade of filters and experiment with the results of different filter configurations/parameters.

Additionally, these filters have been used for curating high-quality non-English documents. However, it is advised that when applying to non-English data that users write out the document scores by specifying the --document-score-dir argument. This will allow users to examine if a particular filter is responsible for undesirably removing many documents from a corpus.

Copy
Copied!
            

filter_documents \ --input-data-dir=<Specify path to input dataset> \ --filter-config-file=./config/heuristic_filter_en.yaml \ --output-retained-document-dir=<Output directory to which high-quality documents will be written> \ --output-removed-document-dir=<Output directory to which low-quality documents will be written> \ --output-document-score-dir=<Output directory to which document scores will be written> \ --log-dir=${log_dir}/heuristic_filter

Previous CPU and GPU Modules with Dask
Next Language Identification and Unicode Fixing
© Copyright 2023-2024, NVIDIA. Last updated on Apr 25, 2024.