*** description: >- Clean, normalize, and transform text content to meet specific requirements including text cleaning and normalization categories: * workflows tags: * content-processing * text-cleaning * unicode * normalization personas: * data-scientist-focused * mle-focused difficulty: intermediate content\_type: workflow modality: text-only *** # Content Processing & Cleaning Clean, normalize, and transform text content to meet specific requirements for training language models using NeMo Curator's tools and utilities. Content processing involves transforming your text data while preserving essential information. This includes fixing encoding issues and standardizing text format to ensure high-quality input for model training. ## How it Works Content processing transformations typically modify documents in place or create new versions with specific changes. Most processing tools follow this pattern: 1. Load your dataset using pipeline readers (JsonlReader, ParquetReader) 2. Configure and apply the appropriate processor 3. Save the transformed dataset for further processing You can combine processing tools in sequence or use them alongside other curation steps like filtering and language management. *** ## Available Processing Tools Add unique identifiers to documents for tracking and deduplication identifiers tracking preprocessing deduplication Fix Unicode issues, standardize spacing, and remove URLs unicode normalization preprocessing urls ## Usage Here's an example of a typical content processing pipeline: ```python from nemo_curator.core.client import RayClient from nemo_curator.pipeline import Pipeline from nemo_curator.stages.text.io.reader import JsonlReader from nemo_curator.stages.text.io.writer import JsonlWriter from nemo_curator.stages.text.modifiers import UnicodeReformatter, UrlRemover, NewlineNormalizer from nemo_curator.stages.text.modules import Modify # Initialize Ray client ray_client = RayClient() ray_client.start() # Create a comprehensive cleaning pipeline processing_pipeline = Pipeline( name="content_processing_pipeline", description="Comprehensive text cleaning and processing" ) # Load dataset reader = JsonlReader(file_paths="input_data/") processing_pipeline.add_stage(reader) # Fix Unicode encoding issues processing_pipeline.add_stage( Modify(modifier_fn=UnicodeReformatter(), input_fields="text") ) # Standardize newlines processing_pipeline.add_stage( Modify(modifier_fn=NewlineNormalizer(), input_fields="text") ) # Remove URLs processing_pipeline.add_stage( Modify(modifier_fn=UrlRemover(), input_fields="text") ) # Save the processed dataset writer = JsonlWriter(path="processed_output/") processing_pipeline.add_stage(writer) # Execute pipeline results = processing_pipeline.run() # Stop Ray client ray_client.stop() ``` ## Common Processing Tasks ### Text Normalization * Fix broken Unicode characters (mojibake) * Standardize whitespace and newlines * Remove or normalize special characters ### Content Sanitization * Strip unwanted URLs or links * Remove boilerplate text or headers ### Format Standardization * Ensure consistent text encoding * Normalize punctuation and spacing * Standardize document structure