Language Identification
Large unlabeled text corpora often contain a variety of languages. NVIDIA NeMo Curator provides tools to accurately identify the language of each document, which is essential for language-specific curation tasks and building high-quality monolingual datasets.
How it Works
NeMo Curator’s language identification system works through a three-step process:
-
Text Preprocessing: For FastText classification, normalize input text by stripping whitespace and converting newlines to spaces.
-
FastText Language Detection: The pre-trained FastText language identification model (
lid.176.bin) analyzes the preprocessed text and returns:- A confidence score (0.0 to 1.0) indicating certainty of the prediction
- A language code (for example, “EN”, “ES”, “FR”) in FastText’s two-letter uppercase format
-
Filtering and Scoring: The pipeline filters documents based on a configurable confidence threshold (
min_langid_score) and stores both the confidence score and language code as metadata.
Language Detection Process
The FastTextLangId filter implements this workflow by:
- Loading the FastText language identification model on worker initialization
- Processing text through
model.predict()withk=1to get the top language prediction - Extracting the language code from FastText labels (for example,
__label__enbecomes “EN”) - Comparing confidence scores against the threshold to determine document retention
- Returning results as
[confidence_score, language_code]for downstream processing
This approach supports 176 languages with high accuracy, making it suitable for large-scale multilingual dataset curation where language-specific processing and monolingual dataset creation are critical.
Usage
The following example demonstrates how to create a language identification pipeline using Curator with distributed processing.
Python
Understanding Results
The language identification process adds a score field to each document batch:
-
languagefield: Contains the FastText language identification results as a string representation of a list with two elements (for backend compatibility):- Element 0: The confidence score (between 0 and 1)
- Element 1: The language code in FastText format (for example, “EN” for English, “ES” for Spanish)
-
Task-based processing: Curator processes documents in batches (tasks), and results are available through the task’s Pandas DataFrame:
For quick exploratory inspection, converting a DocumentBatch to a Pandas DataFrame is fine. For performance and scalability, write transformations as ProcessingStages (or with the @processing_stage decorator) and run them inside a Pipeline with an executor. Curator’s parallelism and resource scheduling apply when code runs as pipeline stages; ad‑hoc Pandas code executes on the driver and will not scale.