Language Management
Handle multilingual content and language-specific processing requirements using NeMo Curator’s tools and utilities.
NeMo Curator provides robust tools for managing multilingual text datasets through language detection, stop word management, and specialized handling for non-spaced languages. These tools are essential for creating high-quality monolingual datasets and applying language-specific processing.
Before You Start
- The
FastTextLangIdfilter (used with theScoreFilterstage) requires a FastText language identification model file. Downloadlid.176.bin(orlid.176.ftz) from FastText: Language identification. - On a cluster, ensure the FastText model file is accessible to all workers (for example, a shared filesystem or object storage path).
- Provide newline-delimited JSON (
.jsonl) with atextfield, or settext_fieldinScoreFilter(...). - For HTML extraction workflows (for example, Common Crawl), Curator uses CLD2 to provide language hints.
How it Works
Language management in NeMo Curator typically follows this pattern using the Pipeline API:
Language Processing Capabilities
- Language detection using FastText (176 languages) and CLD2 (used in HTML extraction pipelines)
- Stop word management with built-in lists and customizable thresholds
- Special handling for non-spaced languages (Chinese, Japanese, Thai, Korean)
- Language-specific text processing and quality filtering