Language Identification and Unicode Fixing#

Background#

Large unlabeled text corpora often contain a variety of languages. However, data curation usually includes steps that are language specific (e.g. using language-tuned heuristics for quality filtering) and many curators are only interested in curating a monolingual dataset.

NeMo Curator provides utilities to identify languages using fastText. Even though a preliminary language identification may have been performed on the unextracted text (as is the case in our Common Crawl pipeline using pyCLD2), fastText is more accurate so it can be used for a second pass.

Usage#

We provide an example of how to use the language identification and unicode reformatting utility at examples/identify_languages.py. At a high level, the module first identifies the languages of the documents and removes any documents for which it has high uncertainty about the language.