> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/curator/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/curator/llms-full.txt.

> Handle multilingual content and language-specific processing including language identification, stop word management, and translation

# Language Management

Handle multilingual content and language-specific processing requirements using NeMo Curator's tools and utilities.

NeMo Curator provides robust tools for managing multilingual text datasets through language detection, stop word management, experimental translation, and specialized handling for non-spaced languages. These tools are essential for creating high-quality monolingual datasets and applying language-specific processing.

## Before You Start

* The `FastTextLangId` filter (used with the `ScoreFilter` stage) requires a FastText language identification model file. Download `lid.176.bin` (or `lid.176.ftz`) from FastText: [Language identification](https://fasttext.cc/docs/en/language-identification.html).
* On a cluster, ensure the FastText model file is accessible to all workers (for example, a shared filesystem or object storage path).
* Provide newline-delimited JSON (`.jsonl`) with a `text` field, or set `text_field` in `ScoreFilter(...)`.
* For HTML extraction workflows (for example, Common Crawl), Curator uses CLD2 to provide language hints.

***

## How it Works

Language management in NeMo Curator typically follows this pattern using the Pipeline API:

```python
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.io.reader import JsonlReader
from nemo_curator.stages.text.filters import ScoreFilter
from nemo_curator.stages.text.filters.fasttext import FastTextLangId

# 1) Build the pipeline
pipeline = Pipeline(name="language_management")

# Read JSONL files into document batches
pipeline.add_stage(
    JsonlReader(file_paths="input_data/*.jsonl", files_per_partition=2)
)

# Identify languages and keep docs above a confidence threshold
pipeline.add_stage(
    ScoreFilter(
        FastTextLangId(model_path="/path/to/lid.176.bin", min_langid_score=0.3),
        score_field="language",
    )
)

# 2) Execute
results = pipeline.run()
```

***

## Language Processing Capabilities

* **Language detection** using FastText (176 languages) and CLD2 (used in HTML extraction pipelines)
* **Stop word management** with built-in lists and customizable thresholds
* **Experimental translation pipelines** for flat and structured fields, including wildcard paths such as `messages.*.content`
* **Special handling** for non-spaced languages (Chinese, Japanese, Thai, Korean)
* **Language-specific** text processing and quality filtering

## Available Tools

<Cards>
  <Card title="Language Identification" href="/curate-text/process-data/language-management/language">
    Identify document languages and separate multilingual datasets
    fasttext
    176-languages
    detection
    classification
  </Card>

  <Card title="Stop Words" href="/curate-text/process-data/language-management/stopwords">
    Manage high-frequency words to enhance text extraction and content detection
    preprocessing
    filtering
    language-specific
    nlp
  </Card>

  <Card title="Translation" href="/curate-text/process-data/language-management/translation">
    Translate flat or structured text fields with optional FAITH and round-trip evaluation
    translation
    experimental
    wildcard-fields
    faith
    round-trip-metrics
  </Card>
</Cards>