Language Identification and Unicode Fixing

Background

Large unlabeled text corpora often contain a variety of languages. However, data curation usually includes steps that are language specific (e.g. using language-tuned heuristics for quality filtering) and many curators are only interested in curating a monolingual dataset. Datasets also may have improperly decoded unicode characters (e.g. “The Mona Lisa doesn’t have eyebrows.” decoding as “The Mona Lisa doesn’t have eyebrows.”).

NeMo Curator provides utilities to identify languages and fix improperly decoded unicode characters. The language identification is performed using fastText and unicode fixing is performed using ftfy. Even though a preliminary language identification may have been performed on the unextracted text (as is the case in our Common Crawl pipeline using pyCLD2), fastText is more accurate so it can be used for a second pass.

Usage

We provide an example of how to use the language identification and unicode reformatting utility at examples/identify_languages_and_fix_unicode.py. At a high level, the module first identifies the languages of the documents and removes any documents for which it has high uncertainty about the language. Notably, this line uses one of the DocmentModifiers that NeMo Curator provides:

cleaner = nc.Modify(UnicodeReformatter())
cleaned_data = cleaner(lang_data)

DocumentModifier``s like ``UnicodeReformatter are very similar to DocumentFilter``s. They implement a single ``modify_document function that takes in a document and outputs a modified document. Here is the implementation of the UnicodeReformatter modifier:

class UnicodeReformatter(DocumentModifier):
    def __init__(self):
        super().__init__()

    def modify_document(self, text: str) -> str:
        return ftfy.fix_text(text)

Also like the DocumentFilter functions, modify_document can be annotated with batched to take in a pandas series of documents instead of a single document.