Language Identification#
Large unlabeled text corpora often contain a variety of languages. NVIDIA NeMo Curator provides tools to accurately identify the language of each document, which is essential for language-specific curation tasks and building high-quality monolingual datasets.
How it Works#
NeMo Curator’s language identification system works through a three-step process:
Text Preprocessing: For FastText classification, normalize input text by stripping whitespace and converting newlines to spaces.
FastText Language Detection: The pre-trained FastText language identification model (
lid.176.bin) analyzes the preprocessed text and returns:A confidence score (0.0 to 1.0) indicating certainty of the prediction
A language code (for example, “EN”, “ES”, “FR”) in FastText’s two-letter uppercase format
Filtering and Scoring: The pipeline filters documents based on a configurable confidence threshold (
min_langid_score) and stores both the confidence score and language code as metadata.
Language Detection Process#
The FastTextLangId filter implements this workflow by:
Loading the FastText language identification model on worker initialization
Processing text through
model.predict()withk=1to get the top language predictionExtracting the language code from FastText labels (for example,
__label__enbecomes “EN”)Comparing confidence scores against the threshold to determine document retention
Returning results as
[confidence_score, language_code]for downstream processing
This approach supports 176 languages with high accuracy, making it suitable for large-scale multilingual dataset curation where language-specific processing and monolingual dataset creation are critical.
Usage#
The following example demonstrates how to create a language identification pipeline using Curator with distributed processing.
"""Language identification using Curator."""
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.filters import FastTextLangId
from nemo_curator.stages.text.io.reader import JsonlReader
from nemo_curator.stages.text.modules import ScoreFilter
def create_language_identification_pipeline(data_dir: str) -> Pipeline:
"""Create a pipeline for language identification."""
# Define pipeline
pipeline = Pipeline(
name="language_identification",
description="Identify document languages using FastText"
)
# Add stages
# 1. Reader stage - creates tasks from JSONL files
pipeline.add_stage(
JsonlReader(
file_paths=data_dir,
files_per_partition=2, # Each task processes 2 files
)
)
# 2. Language identification with filtering
# IMPORTANT: Download lid.176.bin or lid.176.ftz from https://fasttext.cc/docs/en/language-identification.html
fasttext_model_path = "/path/to/lid.176.bin" # or lid.176.ftz (compressed)
pipeline.add_stage(
ScoreFilter(
FastTextLangId(model_path=fasttext_model_path, min_langid_score=0.3),
score_field="language"
)
)
return pipeline
def main():
# Create pipeline
pipeline = create_language_identification_pipeline("./data")
# Print pipeline description
print(pipeline.describe())
# Create executor and run
results = pipeline.run()
# Process results
total_documents = sum(task.num_items for task in results) if results else 0
print(f"Total documents processed: {total_documents}")
# Access language scores
for i, batch in enumerate(results):
if batch.num_items > 0:
df = batch.to_pandas()
print(f"Batch {i} columns: {list(df.columns)}")
# Language scores are now in the 'language' field
if __name__ == "__main__":
main()
Understanding Results#
The language identification process adds a score field to each document batch:
languagefield: Contains the FastText language identification results as a string representation of a list with two elements (for backend compatibility):Element 0: The confidence score (between 0 and 1)
Element 1: The language code in FastText format (for example, “EN” for English, “ES” for Spanish)
Task-based processing: Curator processes documents in batches (tasks), and results are available through the task’s Pandas DataFrame:
# Access results from pipeline execution
for batch in results:
df = batch.to_pandas()
# Language scores are in the 'language' column
print(df[['text', 'language']].head())
Tip
For quick exploratory inspection, converting a DocumentBatch to a Pandas DataFrame is fine. For performance and scalability, write transformations as ProcessingStages (or with the @processing_stage decorator) and run them inside a Pipeline with an executor. Curator’s parallelism and resource scheduling apply when code runs as pipeline stages; ad‑hoc Pandas code executes on the driver and will not scale.