Handle multilingual content and language-specific processing requirements using NeMo Curator’s tools and utilities.
NeMo Curator provides robust tools for managing multilingual text datasets through language detection, stop word management, and specialized handling for non-spaced languages. These tools are essential for creating high-quality monolingual datasets and applying language-specific processing.
FastTextLangId filter (used with the ScoreFilter stage) requires a FastText language identification model file. Download lid.176.bin (or lid.176.ftz) from FastText: Language identification..jsonl) with a text field, or set text_field in ScoreFilter(...).Language management in NeMo Curator typically follows this pattern using the Pipeline API: