Content Processing & Cleaning#
Clean, normalize, and transform text content to meet specific requirements for training language models using NeMo Curator’s tools and utilities.
Content processing involves transforming your text data while preserving essential information. This includes fixing encoding issues and standardizing text format to ensure high-quality input for model training.
How it Works#
Content processing transformations typically modify documents in place or create new versions with specific changes. Most processing tools follow this pattern:
Load your dataset using pipeline readers (JsonlReader, ParquetReader)
Configure and apply the appropriate processor
Save the transformed dataset for further processing
You can combine processing tools in sequence or use them alongside other curation steps like filtering and language management.
Available Processing Tools#
Add unique identifiers to documents for tracking and deduplication
Fix Unicode issues, standardize spacing, and remove URLs
Usage#
Here’s an example of a typical content processing pipeline:
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.text.io.reader import JsonlReader
from nemo_curator.stages.text.io.writer import JsonlWriter
from nemo_curator.stages.text.modifiers import UnicodeReformatter, UrlRemover, NewlineNormalizer
from nemo_curator.stages.text.modules import Modify
# Create a comprehensive cleaning pipeline
processing_pipeline = Pipeline(
name="content_processing_pipeline",
description="Comprehensive text cleaning and processing"
)
# Load dataset
reader = JsonlReader(file_paths="input_data/*.jsonl")
processing_pipeline.add_stage(reader)
# Fix Unicode encoding issues
processing_pipeline.add_stage(
Modify(modifier=UnicodeReformatter(), text_field="text")
)
# Standardize newlines
processing_pipeline.add_stage(
Modify(modifier=NewlineNormalizer(), text_field="text")
)
# Remove URLs
processing_pipeline.add_stage(
Modify(modifier=UrlRemover(), text_field="text")
)
# Save the processed dataset
writer = JsonlWriter(path="processed_output/")
processing_pipeline.add_stage(writer)
# Execute pipeline
results = processing_pipeline.run()
Common Processing Tasks#
Text Normalization#
Fix broken Unicode characters (mojibake)
Standardize whitespace and newlines
Remove or normalize special characters
Content Sanitization#
Strip unwanted URLs or links
Remove boilerplate text or headers
Format Standardization#
Ensure consistent text encoding
Normalize punctuation and spacing
Standardize document structure