For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI Reference
DocumentationAPI Reference
  • Home
    • Welcome
  • About NeMo Curator
    • Overview
    • Key Features
  • Get Started
    • Overview
    • Install (All Modalities)
    • Text Quickstart
    • Image Quickstart
    • Video Quickstart
    • Audio Quickstart
  • Curate Text
    • Overview
    • Tutorials
      • Overview
        • Overview
        • Language Detection
        • Stopwords
    • Save and Export
  • Curate Images
    • Overview
    • Save and Export
  • Curate Video
    • Overview
    • Load Data
    • Save and Export
  • Curate Audio
    • Overview
    • Save and Export
  • Setup & Deployment
    • Overview
  • Reference
    • Overview
    • Related Tools
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Curator
On this page
  • Before You Start
  • How it Works
  • Language Processing Capabilities
  • Available Tools
Curate TextProcess DataLanguage Management

Language Management

||View as Markdown|
Previous

Semantic Deduplication

Next

Language Detection

Handle multilingual content and language-specific processing requirements using NeMo Curator’s tools and utilities.

NeMo Curator provides robust tools for managing multilingual text datasets through language detection, stop word management, and specialized handling for non-spaced languages. These tools are essential for creating high-quality monolingual datasets and applying language-specific processing.

Before You Start

  • The FastTextLangId filter (used with the ScoreFilter stage) requires a FastText language identification model file. Download lid.176.bin (or lid.176.ftz) from FastText: Language identification.
  • On a cluster, ensure the FastText model file is accessible to all workers (for example, a shared filesystem or object storage path).
  • Provide newline-delimited JSON (.jsonl) with a text field, or set text_field in ScoreFilter(...).
  • For HTML extraction workflows (for example, Common Crawl), Curator uses CLD2 to provide language hints.

How it Works

Language management in NeMo Curator typically follows this pattern using the Pipeline API:

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.stages.text.io.reader import JsonlReader
3from nemo_curator.stages.text.filters import ScoreFilter
4from nemo_curator.stages.text.filters.fasttext import FastTextLangId
5
6# 1) Build the pipeline
7pipeline = Pipeline(name="language_management")
8
9# Read JSONL files into document batches
10pipeline.add_stage(
11 JsonlReader(file_paths="input_data/*.jsonl", files_per_partition=2)
12)
13
14# Identify languages and keep docs above a confidence threshold
15pipeline.add_stage(
16 ScoreFilter(
17 FastTextLangId(model_path="/path/to/lid.176.bin", min_langid_score=0.3),
18 score_field="language",
19 )
20)
21
22# 2) Execute
23results = pipeline.run()

Language Processing Capabilities

  • Language detection using FastText (176 languages) and CLD2 (used in HTML extraction pipelines)
  • Stop word management with built-in lists and customizable thresholds
  • Special handling for non-spaced languages (Chinese, Japanese, Thai, Korean)
  • Language-specific text processing and quality filtering

Available Tools

Language Identification

Identify document languages and separate multilingual datasets fasttext 176-languages detection classification

Stop Words

Manage high-frequency words to enhance text extraction and content detection preprocessing filtering language-specific nlp