Curate TextProcess DataLanguage Management

Language Management

View as Markdown

Handle multilingual content and language-specific processing requirements using NeMo Curator’s tools and utilities.

NeMo Curator provides robust tools for managing multilingual text datasets through language detection, stop word management, and specialized handling for non-spaced languages. These tools are essential for creating high-quality monolingual datasets and applying language-specific processing.

Before You Start

  • The FastTextLangId filter (used with the ScoreFilter stage) requires a FastText language identification model file. Download lid.176.bin (or lid.176.ftz) from FastText: Language identification.
  • On a cluster, ensure the FastText model file is accessible to all workers (for example, a shared filesystem or object storage path).
  • Provide newline-delimited JSON (.jsonl) with a text field, or set text_field in ScoreFilter(...).
  • For HTML extraction workflows (for example, Common Crawl), Curator uses CLD2 to provide language hints.

How it Works

Language management in NeMo Curator typically follows this pattern using the Pipeline API:

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.stages.text.io.reader import JsonlReader
3from nemo_curator.stages.text.modules import ScoreFilter
4from nemo_curator.stages.text.filters import FastTextLangId
5
6# 1) Build the pipeline
7pipeline = Pipeline(name="language_management")
8
9# Read JSONL files into document batches
10pipeline.add_stage(
11 JsonlReader(file_paths="input_data/*.jsonl", files_per_partition=2)
12)
13
14# Identify languages and keep docs above a confidence threshold
15pipeline.add_stage(
16 ScoreFilter(
17 FastTextLangId(model_path="/path/to/lid.176.bin", min_langid_score=0.3),
18 score_field="language",
19 )
20)
21
22# 2) Execute
23results = pipeline.run()

Language Processing Capabilities

  • Language detection using FastText (176 languages) and CLD2 (used in HTML extraction pipelines)
  • Stop word management with built-in lists and customizable thresholds
  • Special handling for non-spaced languages (Chinese, Japanese, Thai, Korean)
  • Language-specific text processing and quality filtering

Available Tools