Language Identification and Unicode Fixing

Large unlabeled text corpora often contain a variety of languages. However, data curation usually includes steps that are language specific (e.g. using language-tuned heuristics for quality filtering) and many curators are only interested in curating a monolingual dataset. Datasets also may have improperly decoded unicode characters (e.g. “The Mona Lisa doesn’t have eyebrows.” decoding as “The Mona Lisa doesn’t have eyebrows.”).

The NeMo Data Curator provides utilities to identify languages and fix improperly decoded unicode characters. The language identification is performed using fastText and unicode fixing is performed using ftfy. Even though a preliminary language identification may have been performed on the unextracted text (as is the case in our Common Crawl pipeline using pyCLD2), fastText is more accurate so it can be used for a second pass.

While the following steps can be run manually using the commands given, we also provide a SLURM script in the examples folder that follows the same procedure. It must be filled in with the necessary parameters described below before running.

To perform the language identification, we can use the config file provided in the config directory and provide the path to a local copy of the lid.176.bin language identification fastText model. Then, with the general purpose filter_documents tool, we can compute language scores and codes for each document in the corpus as follows

Copy
Copied!
            

filter_documents \ --input-data-dir=<Path to directory containing jsonl files> \ --filter-config-file=./config/fasttext_langid.yaml \ --log-scores \ --log-dir=./log/lang_id

This will apply the fastText model, compute the score and obtain the language class, and then write this information as additonal keys within each json document. For more information on the filter_documents utility, please see the doc 1_document_filtering.rst.

With the language information present within the keys of each json, the separate_by_language, will first construct a count of the documents by language within the corpus and then from that information, split each file across all the languages within that file. Below is an example run command for separate_by_language

Copy
Copied!
            

separate_by_language \ --input-data-dir=<Path to the input directory containing jsonl files> \ --output-data-dir=<Output directory containing language sub-directories> \ --output-language-distribution=./data/lang_distro.json \ --log-dir=./log/language_separation

After running this module, the output directory will consist of one directory per language present within the corpus and all documents within those directories will contain text that originates from the same language. Finally, the text within a specific language can have its unicode fixed using the text_cleaning module

Copy
Copied!
            

text_cleaning \ --input-data-dir=<Output directory containing sub-directories>/EN \ --output-clean-dir=<Output directory to which cleaned english documents will be written> \ --log-dir=./log/text_cleaning

The above text_cleaning module uses the heuristics defined within the ftfy package that is commonly used for fixing improperly decoded unicode.

Previous Document filtering
Next Exact and fuzzy deduplication
© Copyright 2023-2024, NVIDIA. Last updated on Jan 19, 2024.