Text cleaning and language separation
After the documents have been downloaded and extracted from the WARC records to jsonl format,
separate_by_language modules enable users to perform a secondary
pass of language identification with fastText language identification models, separate the documents by language,
and then within the target languages, proceed to fix documents with improperly decoded unicode.
To perform the secondary pass of language identification, we can use the config file provided in the config directory
and provide the path to a local copy of the lid.176.bin language identification fastText model. Then, with the
filter_documents, we can compute language scores and codes for each document in the corpus as follows:
filter_documents \ --input-data-dir=<Path to directory containing jsonl files> \ --filter-config-file=./config/fasttext_langid.yaml \ --log-scores \ --log-dir=./log_dir/lang_id
This will apply the fastText model, compute the score and obtain the language class, and then write this
information as additonal keys within each json document. For more information on the
utility, please see the doc 1_document_filtering.rst.
With the language information present within the keys of each json, the
separate_by_language, will first construct
a count of the documents by language within the corpus and then from that information, split each file across all the languages
within that file. Below is an example run command for
separate_by_language \ --input-data-dir=<Path to the input directory containing jsonl files> \ --output-data-dir=<Output directory containing language sub-directories> \ --output-language-distribution=./data/lang_distro.json \ --log-dir=./log/language_separation
After running this module, the output directory will consist of one directory per language present within the corpus and all documents
within those directories will contain text that originates from the same language. Finally, the text within a specific language can have
its unicode fixed using the
text_cleaning \ --input-data-dir=<Output directory containing sub-directories>/EN \ --output-clean-dir=< Output directory to which cleaned english documents will be written> \ --log-dir=./log/text_cleaning
text_cleaning module uses the heuristics defined within the
ftfy package that is commonly used for fixing
improperly decoded unicode.