Language Detection | NeMo Curator

Large unlabeled text corpora often contain a variety of languages. NVIDIA NeMo Curator provides tools to accurately identify the language of each document, which is essential for language-specific curation tasks and building high-quality monolingual datasets.

How it Works

NeMo Curator’s language identification system works through a three-step process:

Text Preprocessing: For FastText classification, normalize input text by stripping whitespace and converting newlines to spaces.
FastText Language Detection: The pre-trained FastText language identification model (lid.176.bin) analyzes the preprocessed text and returns:
- A confidence score (0.0 to 1.0) indicating certainty of the prediction
- A language code (for example, “EN”, “ES”, “FR”) in FastText’s two-letter uppercase format
Filtering and Scoring: The pipeline filters documents based on a configurable confidence threshold (min_langid_score) and stores both the confidence score and language code as metadata.

Language Detection Process

The FastTextLangId filter implements this workflow by:

Loading the FastText language identification model on worker initialization
Processing text through model.predict() with k=1 to get the top language prediction
Extracting the language code from FastText labels (for example, __label__en becomes “EN”)
Comparing confidence scores against the threshold to determine document retention
Returning results as [confidence_score, language_code] for downstream processing

This approach supports 176 languages with high accuracy, making it suitable for large-scale multilingual dataset curation where language-specific processing and monolingual dataset creation are critical.

Usage

The following example demonstrates how to create a language identification pipeline using Curator with distributed processing.

Python

1 """Language identification using Curator."""
2 
3 from nemo_curator.pipeline import Pipeline
4 from nemo_curator.stages.text.filters import ScoreFilter
5 from nemo_curator.stages.text.filters.fasttext import FastTextLangId
6 from nemo_curator.stages.text.io.reader import JsonlReader
7 
8 def create_language_identification_pipeline(data_dir: str) -> Pipeline:
9     """Create a pipeline for language identification."""
10 
11     # Define pipeline
12     pipeline = Pipeline(
13         name="language_identification",
14         description="Identify document languages using FastText"
15     )
16 
17     # Add stages
18     # 1. Reader stage - creates tasks from JSONL files
19     pipeline.add_stage(
20         JsonlReader(
21             file_paths=data_dir,
22             files_per_partition=2,  # Each task processes 2 files
23         )
24     )
25 
26     # 2. Language identification with filtering
27     # IMPORTANT: Download lid.176.bin or lid.176.ftz from https://fasttext.cc/docs/en/language-identification.html
28     fasttext_model_path = "/path/to/lid.176.bin"  # or lid.176.ftz (compressed)
29     pipeline.add_stage(
30         ScoreFilter(
31             FastTextLangId(model_path=fasttext_model_path, min_langid_score=0.3),
32             score_field="language"
33         )
34     )
35 
36     return pipeline
37 
38 def main():
39     # Create pipeline
40     pipeline = create_language_identification_pipeline("./data")
41 
42     # Print pipeline description
43     print(pipeline.describe())
44 
45     # Create executor and run
46     results = pipeline.run()
47 
48     # Process results
49 
50     total_documents = sum(task.num_items for task in results) if results else 0
51     print(f"Total documents processed: {total_documents}")
52 
53     # Access language scores
54     for i, batch in enumerate(results):
55         if batch.num_items &gt;0:
56             df = batch.to_pandas()
57             print(f"Batch {i} columns: {list(df.columns)}")
58             # Language scores are now in the 'language' field
59 
60 if __name__ == "__main__":
61     main()

Understanding Results

The language identification process adds a score field to each document batch:

language field: Contains the FastText language identification results as a string representation of a list with two elements (for backend compatibility):
- Element 0: The confidence score (between 0 and 1)
- Element 1: The language code in FastText format (for example, “EN” for English, “ES” for Spanish)
Task-based processing: Curator processes documents in batches (tasks), and results are available through the task’s Pandas DataFrame:

1 # Access results from pipeline execution
2 for batch in results:
3     df = batch.to_pandas()
4     # Language scores are in the 'language' column
5     print(df[['text', 'language']].head())

For quick exploratory inspection, converting a DocumentBatch to a Pandas DataFrame is fine. For performance and scalability, write transformations as ProcessingStages (or with the @processing_stage decorator) and run them inside a Pipeline with an executor. Curator’s parallelism and resource scheduling apply when code runs as pipeline stages; ad‑hoc Pandas code executes on the driver and will not scale.