Classifier and Heuristic Quality Filtering

Document level quality filtering involves computing some metric for a given document, and removing the document if it fails to meet some threshold defined by the metric. Classifier-based filtering involves training a small text classifer to label a document as either high quality or low quality. Heuristic-based filtering uses simple rules (e.g. Are there too many punctuation marks?) to score a document. Using the filter_documents utility, the NeMo Data Curator offers both classifier and heuristic-based quality filtering of documents. It also offers the tools to train your own custom quality classifier.

While the following steps can be run manually using the commands given, we also provide a SLURM script in the examples folder that follows the same procedure. It must be filled in with the necessary parameters described below before running.

The classifier-based filtering approach we have implemented follows closely to that used in Brown et al., 2020, and trains a binary skip-gram classifier that can be used to distinguish between low and high quality documents. To implement this, we use the functions provided by fastText. Following the examples provided in the fastText documentation, we first create a file consisting of high and low-quality training documents. This can be achieved using the prepare_fasttext_training_data script that will randomly sample documents from an input dataset and will prepare them to be used to train a fasText skip-gram classifier. For a high-quality dataset we recommend sampling from either OpenWebText2, Wikipedia or Books3 and an unfiltered version of Common Crawl can be used for a low-quality dataset.

Copy
Copied!
            

prepare_fasttext_training_data \ --input-data-dir=<Specify the path to common-crawl/low-quality data> \ --output-num-samples=<Specify the number of low-quality documents to be used for training> \ --label='__label__cc' \ --output-train-file=${res_dir}/cc_samples.txt \ --output-file-sizes=${res_dir}/cc_files_sizes.json \ --log-dir=${log_dir}/prepare_filter_data_cc prepare_fasttext_training_data \ --input-data-dir=<Specify the path to high-quality data> \ --output-num-samples=<Specify the number of high-quality documents to be used for training> \ --label='__label__hq' \ --output-train-file=${res_dir}/hq_samples.txt \ --output-file-sizes=${res_dir}/hq_files_sizes.json \ --log-dir=${log_dir}/prepare_filter_data_hq

Once the samples have been prepared and written to .txt files, users can use the train_fasttext script that reads in the samples within the .txt files in order to train a quality classifier. train_fasttext will read in all of the samples within the .txt files, split the data into training and validation sets and train the binary skip-gram classifier. After training, it evaluates the model on the validation samples and writes the predictions to a jsonl file prints the confusion matrix to stdout.

Copy
Copied!
            

train_fasttext \ --fasttext-files-dir=${res_dir} \ --output-train-file=${res_dir}/fasttext_samples.train \ --output-validation-file=${res_dir}/fasttext_samples.valid \ --output-model=${res_dir}/cc_filter_test.bin \ --output-predictions=${res_dir}/preds.jsonl

Finally, with the model trained and able to provide quality scores, it can be used to for quality filtering. Similar to how filter_documents performs language identification with the fastText model lid.176.bin, we provide a default config that can be used for classifier-based quality filtering with a fastText model. Additionally, this filter implements Pareto-based sampling approach as is described in Brown et al., 2020.

Copy
Copied!
            

filter_documents \ --input-data-dir=<Specify the path to common-crawl/uncurated data> \ --filter-config-file=./config/fasttext_quality_filter.yaml \ --output-retained-document-dir=<Output directory to which high-quality documents will be written> \ --output-removed-document-dir=<Output directory to which low-quality documents will be written> \ --log-dir=${log_dir}/fasttext_classifier \

While the following steps can be run manually using the commands given, we also provide a SLURM script in the examples folder that follows the same procedure. It must be filled in with the necessary parameters described below before running.

As with other filtering steps, the heuristic-based filtering in NeMo Data Curator can be carried out using the filter_documents utility. The filter config file config/heuristic_filter.yaml provides a generic cascaded heuristic filter that has been tested and shown to provide documents that when used for training, lead to improvements in language model downstream task performance. The cascaded filter is general enough that users should feel free to remove certain filters within the cascade of filters and experiment with the results of different filter configurations/parameters.

Additionally, these filters have been used for curating high-quality non-English documents. However, it is advised that when applying to non-English data that users write out the document scores by specifying the --document-score-dir argument. This will allow users to examine if a particular filter is responsible for undesirably removing many documents from a corpus.

Copy
Copied!
            

filter_documents \ --input-data-dir=<Specify path to input dataset> \ --filter-config-file=./config/heuristic_filter_en.yaml \ --output-retained-document-dir=<Output directory to which high-quality documents will be written> \ --output-removed-document-dir=<Output directory to which low-quality documents will be written> \ --output-document-score-dir=<Output directory to which document scores will be written> \ --log-dir=${log_dir}/heuristic_filter

Previous GPU Accelerated Exact and Fuzzy Deduplication
Next Downstream Task Decontamination/Deduplication
© Copyright 2023-2024, NVIDIA. Last updated on Feb 22, 2024.