Language Management#
Handle multilingual content and language-specific processing requirements using NeMo Curator’s tools and utilities.
NeMo Curator provides robust tools for managing multilingual text datasets through language detection, stop word management, and specialized handling for non-spaced languages. These tools are essential for creating high-quality monolingual datasets and applying language-specific processing.
How it Works#
Language management in NeMo Curator typically follows this pattern:
import nemo_curator as nc
from nemo_curator.datasets import DocumentDataset
from nemo_curator.filters import FastTextLangId
from nemo_curator.filters.heuristic_filter import HistogramFilter
from nemo_curator.modules.filter import ScoreFilter
from nemo_curator.utils.text_utils import get_word_splitter
# Load your dataset
dataset = DocumentDataset.read_json("input_data/*.jsonl")
# Identify languages using FastText
lang_filter = ScoreFilter(
FastTextLangId(
model_path="lid.176.bin",
min_langid_score=0.8
),
text_field="text",
score_field="language"
)
# Apply language identification
dataset = lang_filter(dataset)
# Apply language-specific processing
for lang, subset in dataset.groupby("language"):
if lang in ["zh", "ja", "th", "ko"]:
# Special handling for non-spaced languages
processor = get_word_splitter(lang)
# Note: Word splitter returns a function for processing text
# Apply as needed for your specific use case
# Apply language-specific stop words
stop_filter = ScoreFilter(
HistogramFilter(lang=lang),
text_field="text",
score_field="quality"
)
subset = stop_filter(subset)
Language Processing Capabilities#
Language detection using FastText and CLD2 (176+ languages)
Stop word management with built-in lists and customizable thresholds
Special handling for non-spaced languages (Chinese, Japanese, Thai, Korean)
Language-specific text processing and quality filtering
Available Tools#
Identify document languages and separate multilingual datasets
Manage high-frequency words to enhance text extraction and content detection
Usage#
Quick Start Example#
from nemo_curator import ScoreFilter
from nemo_curator.datasets import DocumentDataset
from nemo_curator.filters import FastTextLangId
# Load multilingual dataset
dataset = DocumentDataset.read_json("multilingual_data/*.jsonl")
# Identify languages
langid_filter = ScoreFilter(
FastTextLangId(
model_path="/path/to/lid.176.bin", # Download from fasttext.cc
min_langid_score=0.3
),
text_field="text",
score_field="language",
score_type="object"
)
# Apply language identification
identified_dataset = langid_filter(dataset)
# Extract language codes
identified_dataset.df["language"] = identified_dataset.df["language"].apply(
lambda score: score[1] # Extract language code from [score, lang_code]
)
# Filter for specific languages
english_docs = identified_dataset[identified_dataset.df.language == "EN"]
spanish_docs = identified_dataset[identified_dataset.df.language == "ES"]
# Save by language
english_docs.to_json("output/english/", write_to_filename=True)
spanish_docs.to_json("output/spanish/", write_to_filename=True)
Supported Languages#
NeMo Curator supports language identification for 176 languages through FastText, including:
Major languages: English, Spanish, French, German, Chinese, Japanese, Arabic, Russian
Regional languages: Many local and regional languages worldwide
Special handling: Non-spaced languages (Chinese, Japanese, Thai, Korean)
Best Practices#
Download the FastText model: Get
lid.176.bin
from fasttext.ccSet appropriate thresholds: Balance precision vs. recall based on your needs
Handle non-spaced languages: Use special processing for Chinese, Japanese, Thai, Korean
Validate on your domain: Test language detection accuracy on your specific data