***

description: >-
Identify document languages accurately using FastText models supporting 176
languages for multilingual text processing
categories:

* how-to-guides
  tags:
* language-identification
* fasttext
* multilingual
* 176-languages
* detection
* classification
  personas:
* data-scientist-focused
* mle-focused
  difficulty: intermediate
  content\_type: how-to
  modality: text-only

***

# Language Identification

Large unlabeled text corpora often contain a variety of languages. NVIDIA NeMo Curator provides tools to accurately identify the language of each document, which is essential for language-specific curation tasks and building high-quality monolingual datasets.

## How it Works

NeMo Curator's language identification system works through a three-step process:

1. **Text Preprocessing**: For FastText classification, normalize input text by stripping whitespace and converting newlines to spaces.

2. **FastText Language Detection**: The pre-trained FastText language identification model ([`lid.176.bin`](https://fasttext.cc/docs/en/language-identification.html)) analyzes the preprocessed text and returns:
   * A confidence score (0.0 to 1.0) indicating certainty of the prediction
   * A language code (for example, "EN", "ES", "FR") in FastText's two-letter uppercase format

3. **Filtering and Scoring**: The pipeline filters documents based on a configurable confidence threshold (`min_langid_score`) and stores both the confidence score and language code as metadata.

### Language Detection Process

The `FastTextLangId` filter implements this workflow by:

* Loading the FastText language identification model on worker initialization
* Processing text through `model.predict()` with `k=1` to get the top language prediction
* Extracting the language code from FastText labels (for example, `__label__en` becomes "EN")
* Comparing confidence scores against the threshold to determine document retention
* Returning results as `[confidence_score, language_code]` for downstream processing

This approach supports **176 languages** with high accuracy, making it suitable for large-scale multilingual dataset curation where language-specific processing and monolingual dataset creation are critical.

## Usage

The following example demonstrates how to create a language identification pipeline using Curator with distributed processing.

<Tabs>
  <Tab title="Python">
    ```python
    """Language identification using Curator."""

    from nemo_curator.pipeline import Pipeline
    from nemo_curator.stages.text.filters import FastTextLangId
    from nemo_curator.stages.text.io.reader import JsonlReader
    from nemo_curator.stages.text.modules import ScoreFilter

    def create_language_identification_pipeline(data_dir: str) -> Pipeline:
        """Create a pipeline for language identification."""

        # Define pipeline
        pipeline = Pipeline(
            name="language_identification",
            description="Identify document languages using FastText"
        )

        # Add stages
        # 1. Reader stage - creates tasks from JSONL files
        pipeline.add_stage(
            JsonlReader(
                file_paths=data_dir,
                files_per_partition=2,  # Each task processes 2 files
            )
        )

        # 2. Language identification with filtering
        # IMPORTANT: Download lid.176.bin or lid.176.ftz from https://fasttext.cc/docs/en/language-identification.html
        fasttext_model_path = "/path/to/lid.176.bin"  # or lid.176.ftz (compressed)
        pipeline.add_stage(
            ScoreFilter(
                FastTextLangId(model_path=fasttext_model_path, min_langid_score=0.3),
                score_field="language"
            )
        )

        return pipeline

    def main():
        # Create pipeline
        pipeline = create_language_identification_pipeline("./data")

        # Print pipeline description
        print(pipeline.describe())

        # Create executor and run
        results = pipeline.run()

        # Process results

        total_documents = sum(task.num_items for task in results) if results else 0
        print(f"Total documents processed: {total_documents}")

        # Access language scores
        for i, batch in enumerate(results):
            if batch.num_items &gt;0:
                df = batch.to_pandas()
                print(f"Batch {i} columns: {list(df.columns)}")
                # Language scores are now in the 'language' field

    if __name__ == "__main__":
        main()
    ```
  </Tab>
</Tabs>

## Understanding Results

The language identification process adds a score field to each document batch:

1. **`language` field**: Contains the FastText language identification results as a string representation of a list with two elements (for backend compatibility):
   * Element 0: The confidence score (between 0 and 1)
   * Element 1: The language code in FastText format (for example, "EN" for English, "ES" for Spanish)

2. **Task-based processing**: Curator processes documents in batches (tasks), and results are available through the task's Pandas DataFrame:

```python
# Access results from pipeline execution
for batch in results:
    df = batch.to_pandas()
    # Language scores are in the 'language' column
    print(df[['text', 'language']].head())
```

<Tip>
  For quick exploratory inspection, converting a `DocumentBatch` to a Pandas DataFrame is fine. For performance and scalability, write transformations as `ProcessingStage`s (or with the `@processing_stage` decorator) and run them inside a `Pipeline` with an executor. Curator’s parallelism and resource scheduling apply when code runs as pipeline stages; ad‑hoc Pandas code executes on the driver and will not scale.
</Tip>