***
description: >-
Identify document languages accurately using FastText models supporting 176
languages for multilingual text processing
categories:
* how-to-guides
tags:
* language-identification
* fasttext
* multilingual
* 176-languages
* detection
* classification
personas:
* data-scientist-focused
* mle-focused
difficulty: intermediate
content\_type: how-to
modality: text-only
***
# Language Identification
Large unlabeled text corpora often contain a variety of languages. NVIDIA NeMo Curator provides tools to accurately identify the language of each document, which is essential for language-specific curation tasks and building high-quality monolingual datasets.
## How it Works
NeMo Curator's language identification system works through a three-step process:
1. **Text Preprocessing**: For FastText classification, normalize input text by stripping whitespace and converting newlines to spaces.
2. **FastText Language Detection**: The pre-trained FastText language identification model ([`lid.176.bin`](https://fasttext.cc/docs/en/language-identification.html)) analyzes the preprocessed text and returns:
* A confidence score (0.0 to 1.0) indicating certainty of the prediction
* A language code (for example, "EN", "ES", "FR") in FastText's two-letter uppercase format
3. **Filtering and Scoring**: The pipeline filters documents based on a configurable confidence threshold (`min_langid_score`) and stores both the confidence score and language code as metadata.
### Language Detection Process
The `FastTextLangId` filter implements this workflow by:
* Loading the FastText language identification model on worker initialization
* Processing text through `model.predict()` with `k=1` to get the top language prediction
* Extracting the language code from FastText labels (for example, `__label__en` becomes "EN")
* Comparing confidence scores against the threshold to determine document retention
* Returning results as `[confidence_score, language_code]` for downstream processing
This approach supports **176 languages** with high accuracy, making it suitable for large-scale multilingual dataset curation where language-specific processing and monolingual dataset creation are critical.
## Usage
The following example demonstrates how to create a language identification pipeline using Curator with distributed processing.
```python
"""Language identification using Curator."""
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.filters import FastTextLangId
from nemo_curator.stages.text.io.reader import JsonlReader
from nemo_curator.stages.text.modules import ScoreFilter
def create_language_identification_pipeline(data_dir: str) -> Pipeline:
"""Create a pipeline for language identification."""
# Define pipeline
pipeline = Pipeline(
name="language_identification",
description="Identify document languages using FastText"
)
# Add stages
# 1. Reader stage - creates tasks from JSONL files
pipeline.add_stage(
JsonlReader(
file_paths=data_dir,
files_per_partition=2, # Each task processes 2 files
)
)
# 2. Language identification with filtering
# IMPORTANT: Download lid.176.bin or lid.176.ftz from https://fasttext.cc/docs/en/language-identification.html
fasttext_model_path = "/path/to/lid.176.bin" # or lid.176.ftz (compressed)
pipeline.add_stage(
ScoreFilter(
FastTextLangId(model_path=fasttext_model_path, min_langid_score=0.3),
score_field="language"
)
)
return pipeline
def main():
# Create pipeline
pipeline = create_language_identification_pipeline("./data")
# Print pipeline description
print(pipeline.describe())
# Create executor and run
results = pipeline.run()
# Process results
total_documents = sum(task.num_items for task in results) if results else 0
print(f"Total documents processed: {total_documents}")
# Access language scores
for i, batch in enumerate(results):
if batch.num_items >0:
df = batch.to_pandas()
print(f"Batch {i} columns: {list(df.columns)}")
# Language scores are now in the 'language' field
if __name__ == "__main__":
main()
```
## Understanding Results
The language identification process adds a score field to each document batch:
1. **`language` field**: Contains the FastText language identification results as a string representation of a list with two elements (for backend compatibility):
* Element 0: The confidence score (between 0 and 1)
* Element 1: The language code in FastText format (for example, "EN" for English, "ES" for Spanish)
2. **Task-based processing**: Curator processes documents in batches (tasks), and results are available through the task's Pandas DataFrame:
```python
# Access results from pipeline execution
for batch in results:
df = batch.to_pandas()
# Language scores are in the 'language' column
print(df[['text', 'language']].head())
```
For quick exploratory inspection, converting a `DocumentBatch` to a Pandas DataFrame is fine. For performance and scalability, write transformations as `ProcessingStage`s (or with the `@processing_stage` decorator) and run them inside a `Pipeline` with an executor. Curator’s parallelism and resource scheduling apply when code runs as pipeline stages; ad‑hoc Pandas code executes on the driver and will not scale.