Using the filter_documents
utility (explained in detail in 1_document_filtering.rst), the NeMo Data Curator offers
both classifier and heuristic-based quality filtering of documents.
The classifier-based filtering approach we have implemented follows closely to that used in Brown et al., 2020,
and trains a binary skip-gram classifier that can be used to distinguish between low and high quality documents. To implement this, we use the
functions provided by fastText. Following the examples provided in the fastText documentation, we first create a file consisting of
high and low-quality training documents. This can be achieved using the prepare_fasttext_training_data
script that will randomly sample documents
from an input dataset and will prepare them to be used to train a fasText skip-gram classifier. For a high-quality dataset we recommend sampling from
either OpenWebText2, Wikipedia or Books3 and an unfiltered version of Common Crawl can be used for a low-quality dataset.
Once the samples have been prepared and written to .txt
files, users can use the train_fasttext
script that reads in the samples within the .txt
files
in order to train a quality classifier. train_fasttext
will read in all of the samples within the .txt
files, split the data into training and
validation sets and train the binary skip-gram classifier. After training, it evaluates the model on the validation samples and writes the predictions
to a jsonl file prints the confusion matrix to stdout.
Finally, with the model trained and able to provide quality scores, it can be used to for quality filtering. Similar to how
filter_documents
performs language identification with the fastText model lid.176.bin
, we provide a default config that can
be used for classifier-based quality filtering with a fastText model. Additionally, this filter implements Pareto-based sampling approach
as is described in Brown et al., 2020.
For a complete example of classifier-based filtering and to try it out on your own datasets, please see the example provided in the script ./examples/classifier_filtering.sh
.
As with other filtering steps, the heuristic-based filtering in NeMo Data Curator can be carried out using the filter_documents
utility. The filter config file config/heuristic_filter.yaml
provides a generic cascaded heuristic filter that has been tested
and shown to provide documents that when used for training, lead to improvements in language model downstream task performance.
The cascaded filter is general enough that users should feel free to remove certain filters within the cascade of filters and experiment
with the results of different filter configurations/parameters.
Additionally, these filters have been used for curating high-quality non-English documents. However, it is advised that when applying
to non-English data that users write out the document scores by specifying the --document-score-dir
argument. This will allow users to
examine if a particular filter is responsible for undesirably removing many documents from a corpus.