Text cleaning and language separation

After the documents have been downloaded and extracted from the WARC records to jsonl format, the text_cleaning and separate_by_language modules enable users to perform a secondary pass of language identification with fastText language identification models, separate the documents by language, and then within the target languages, proceed to fix documents with improperly decoded unicode.

To perform the secondary pass of language identification, we can use the config file provided in the config directory and provide the path to a local copy of the lid.176.bin language identification fastText model. Then, with the filter_documents, we can compute language scores and codes for each document in the corpus as follows:

Copy
Copied!
            

filter_documents \ --input-data-dir=<Path to directory containing jsonl files> \ --filter-config-file=./config/fasttext_langid.yaml \ --log-scores \ --log-dir=./log_dir/lang_id

This will apply the fastText model, compute the score and obtain the language class, and then write this information as additonal keys within each json document. For more information on the filter_documents utility, please see the doc 1_document_filtering.rst.

With the language information present within the keys of each json, the separate_by_language, will first construct a count of the documents by language within the corpus and then from that information, split each file across all the languages within that file. Below is an example run command for separate_by_language:

Copy
Copied!
            

separate_by_language \ --input-data-dir=<Path to the input directory containing jsonl files> \ --output-data-dir=<Output directory containing language sub-directories> \ --output-language-distribution=./data/lang_distro.json \ --log-dir=./log/language_separation

After running this module, the output directory will consist of one directory per language present within the corpus and all documents within those directories will contain text that originates from the same language. Finally, the text within a specific language can have its unicode fixed using the text_cleaning module:

Copy
Copied!
            

text_cleaning \ --input-data-dir=<Output directory containing sub-directories>/EN \ --output-clean-dir=< Output directory to which cleaned english documents will be written> \ --log-dir=./log/text_cleaning

The above text_cleaning module uses the heuristics defined within the ftfy package that is commonly used for fixing improperly decoded unicode.

© Copyright 2023, NVIDIA. Last updated on Nov 14, 2023.