Overview | NeMo Curator

Handle multilingual content and language-specific processing requirements using NeMo Curator’s tools and utilities.

NeMo Curator provides robust tools for managing multilingual text datasets through language detection, stop word management, and specialized handling for non-spaced languages. These tools are essential for creating high-quality monolingual datasets and applying language-specific processing.

Before You Start

The FastTextLangId filter (used with the ScoreFilter stage) requires a FastText language identification model file. Download lid.176.bin (or lid.176.ftz) from FastText: Language identification.
On a cluster, ensure the FastText model file is accessible to all workers (for example, a shared filesystem or object storage path).
Provide newline-delimited JSON (.jsonl) with a text field, or set text_field in ScoreFilter(...).
For HTML extraction workflows (for example, Common Crawl), Curator uses CLD2 to provide language hints.

How it Works

Language management in NeMo Curator typically follows this pattern using the Pipeline API:

1 from nemo_curator.pipeline import Pipeline
2 from nemo_curator.stages.text.io.reader import JsonlReader
3 from nemo_curator.stages.text.filters import ScoreFilter
4 from nemo_curator.stages.text.filters.fasttext import FastTextLangId
5 
6 # 1) Build the pipeline
7 pipeline = Pipeline(name="language_management")
8 
9 # Read JSONL files into document batches
10 pipeline.add_stage(
11     JsonlReader(file_paths="input_data/*.jsonl", files_per_partition=2)
12 )
13 
14 # Identify languages and keep docs above a confidence threshold
15 pipeline.add_stage(
16     ScoreFilter(
17         FastTextLangId(model_path="/path/to/lid.176.bin", min_langid_score=0.3),
18         score_field="language",
19     )
20 )
21 
22 # 2) Execute
23 results = pipeline.run()

Language Processing Capabilities

Language detection using FastText (176 languages) and CLD2 (used in HTML extraction pipelines)
Stop word management with built-in lists and customizable thresholds
Special handling for non-spaced languages (Chinese, Japanese, Thai, Korean)
Language-specific text processing and quality filtering

Available Tools

Language Identification

Identify document languages and separate multilingual datasets fasttext 176-languages detection classification

Stop Words

Manage high-frequency words to enhance text extraction and content detection preprocessing filtering language-specific nlp