Classifier and heuristic-based quality filtering

Using the filter_documents utility (explained in detail in 1_document_filtering.rst), the NeMo Data Curator offers both classifier and heuristic-based quality filtering of documents.

The classifier-based filtering approach we have implemented follows closely to that used in Brown et al., 2020, and trains a binary skip-gram classifier that can be used to distinguish between low and high quality documents. To implement this, we use the functions provided by fastText. Following the examples provided in the fastText documentation, we first create a file consisting of high and low-quality training documents. This can be achieved using the prepare_fasttext_training_data script that will randomly sample documents from an input dataset and will prepare them to be used to train a fasText skip-gram classifier. For a high-quality dataset we recommend sampling from either OpenWebText2, Wikipedia or Books3 and an unfiltered version of Common Crawl can be used for a low-quality dataset.

Once the samples have been prepared and written to .txt files, users can use the train_fasttext script that reads in the samples within the .txt files in order to train a quality classifier. train_fasttext will read in all of the samples within the .txt files, split the data into training and validation sets and train the binary skip-gram classifier. After training, it evaluates the model on the validation samples and writes the predictions to a jsonl file prints the confusion matrix to stdout.

Finally, with the model trained and able to provide quality scores, it can be used to for quality filtering. Similar to how filter_documents performs language identification with the fastText model lid.176.bin, we provide a default config that can be used for classifier-based quality filtering with a fastText model. Additionally, this filter implements Pareto-based sampling approach as is described in Brown et al., 2020.

For a complete example of classifier-based filtering and to try it out on your own datasets, please see the example provided in the script ./examples/

As with other filtering steps, the heuristic-based filtering in NeMo Data Curator can be carried out using the filter_documents utility. The filter config file config/heuristic_filter.yaml provides a generic cascaded heuristic filter that has been tested and shown to provide documents that when used for training, lead to improvements in language model downstream task performance. The cascaded filter is general enough that users should feel free to remove certain filters within the cascade of filters and experiment with the results of different filter configurations/parameters.

Additionally, these filters have been used for curating high-quality non-English documents. However, it is advised that when applying to non-English data that users write out the document scores by specifying the --document-score-dir argument. This will allow users to examine if a particular filter is responsible for undesirably removing many documents from a corpus.

© Copyright 2023, NVIDIA. Last updated on Sep 13, 2023.