*** description: >- Text processing workflows including quality filtering, fuzzy deduplication, content cleaning, and pipeline design categories: * concepts-architecture tags: * data-processing * quality-filtering * deduplication * pipeline * distributed personas: * data-scientist-focused * mle-focused difficulty: intermediate content\_type: concept modality: text-only *** # Text Processing Concepts This guide covers the most common text processing workflows in NVIDIA NeMo Curator, based on real-world usage patterns from production data curation pipelines. ## Most Common Workflows The majority of NeMo Curator users follow these core workflows, typically in this order: ### 1. Quality Filtering Most users start with basic quality filtering using heuristic filters to remove low-quality content: **Essential Quality Filters:** * `WordCountFilter` - Remove too short/long documents * `NonAlphaNumericFilter` - Remove symbol-heavy content * `RepeatedLinesFilter` - Remove if content is too repetitive * `PunctuationFilter` - Ensure proper sentence structure * `BoilerPlateStringFilter` - Remove if content contains too much template/boilerplate text ### 2. Content Cleaning and Modification Basic text normalization and cleaning operations: **Common Cleaning Steps:** * `UnicodeReformatter` - Normalize Unicode characters * `NewlineNormalizer` - Standardize line breaks * Basic HTML/markup removal ### 3. Deduplication Remove duplicate and near-duplicate content. For comprehensive coverage of all deduplication approaches, refer to Curator's [Deduplication Concepts](/about/concepts/deduplication). #### Exact Deduplication Remove identical documents, especially useful for smaller datasets: **Implementation:** MD5 or SHA-256 hashing for document identification #### Fuzzy Deduplication For production datasets, fuzzy deduplication is essential to remove near-duplicate content across sources: **Key Components:** * Ray distributed computing framework for scalability * Connected components clustering for duplicate identification #### Semantic Deduplication Remove semantically similar content using embeddings for more sophisticated duplicate detection. ## Core Processing Architecture NeMo Curator uses these fundamental building blocks that users combine into pipelines: | Component | Purpose | Usage Pattern | | ------------------------ | ----------------------------------- | --------------------------------------------------------------------------------- | | **`Pipeline`** | Orchestrate processing stages | Add processing stages, typically starting with a read and completing with a write | | **`ScoreFilter`** | Apply filters with optional scoring | Chain multiple quality filters | | **`Modify`** | Transform document content | Clean and normalize text | | **Reader/Writer Stages** | Load and save text data | Input/output for pipelines | | **Processing Stages** | Transform DocumentBatch tasks | Core processing components | ## Implementation Examples ### Complete Quality Filtering Pipeline This is the most common starting workflow, used in 90% of production pipelines: ```python from nemo_curator.core.client import RayClient from nemo_curator.pipeline import Pipeline from nemo_curator.stages.text.io.reader import JsonlReader from nemo_curator.stages.text.io.writer import JsonlWriter from nemo_curator.stages.text.modules import ScoreFilter from nemo_curator.stages.text.filters import ( WordCountFilter, NonAlphaNumericFilter, RepeatedLinesFilter, PunctuationFilter, BoilerPlateStringFilter ) # Start Ray client ray_client = RayClient() ray_client.start() # Create processing pipeline pipeline = Pipeline(name="quality_filtering") # Load dataset - the starting point for all workflows reader = JsonlReader(file_paths="input_data/") pipeline.add_stage(reader) # Standard quality filtering pipeline (most common) # Remove too short/long documents (essential) # and save the word_count field word_count_filter = ScoreFilter( filter_obj=WordCountFilter(min_words=50, max_words=100000), text_field="text", score_field="word_count" ) pipeline.add_stage(word_count_filter) # Remove symbol-heavy content alpha_numeric_filter = ScoreFilter( filter_obj=NonAlphaNumericFilter(max_non_alpha_numeric_to_text_ratio=0.25), text_field="text" ) pipeline.add_stage(alpha_numeric_filter) # Remove repetitive content repeated_lines_filter = ScoreFilter( filter_obj=RepeatedLinesFilter(max_repeated_line_fraction=0.7), text_field="text" ) pipeline.add_stage(repeated_lines_filter) # Ensure proper sentence structure punctuation_filter = ScoreFilter( filter_obj=PunctuationFilter(max_num_sentences_without_endmark_ratio=0.85), text_field="text" ) pipeline.add_stage(punctuation_filter) # Remove template/boilerplate text boilerplate_filter = ScoreFilter( filter_obj=BoilerPlateStringFilter(), text_field="text" ) pipeline.add_stage(boilerplate_filter) # Add writer stage writer = JsonlWriter(path="filtered_data/") pipeline.add_stage(writer) # Execute pipeline results = pipeline.run() # Cleanup Ray when done ray_client.stop() ``` ### Content Cleaning Pipeline Basic text normalization: ```python from nemo_curator.core.client import RayClient from nemo_curator.pipeline import Pipeline from nemo_curator.stages.text.io.reader import JsonlReader from nemo_curator.stages.text.io.writer import JsonlWriter from nemo_curator.stages.text.modules import Modify from nemo_curator.stages.text.modifiers import UnicodeReformatter # Start Ray client ray_client = RayClient() ray_client.start() # Create cleaning pipeline pipeline = Pipeline(name="content_cleaning") # Read input data reader = JsonlReader(file_paths="input_data/") pipeline.add_stage(reader) # Essential cleaning steps # Normalize unicode characters (very common) unicode_modifier = Modify( modifier_fn=UnicodeReformatter(), input_fields="text" ) pipeline.add_stage(unicode_modifier) # Additional processing steps can be added as needed # Write cleaned data writer = JsonlWriter(path="cleaned_data/") pipeline.add_stage(writer) # Execute pipeline results = pipeline.run() # Cleanup Ray when done ray_client.stop() ``` ### Exact Deduplication Workflow Exact deduplication for any dataset size (requires Ray and at least 1 GPU): ```python from nemo_curator.core.client import RayClient from nemo_curator.stages.deduplication.exact.workflow import ExactDeduplicationWorkflow # Initialize Ray cluster with GPU support (required for exact deduplication) ray_client = RayClient(num_gpus=4) ray_client.start() # Configure exact deduplication workflow exact_workflow = ExactDeduplicationWorkflow( input_path="/path/to/input/data", output_path="/path/to/output", text_field="text", perform_removal=False, # Currently only identification supported assign_id=True, # Automatically assign unique IDs input_filetype="parquet", ) # Run exact deduplication workflow exact_workflow.run() # Cleanup Ray when done ray_client.stop() ``` ### Fuzzy Deduplication Workflow Critical for production datasets (requires Ray and at least 1 GPU): ```python from nemo_curator.core.client import RayClient from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow # Initialize Ray cluster with GPU support (required for fuzzy deduplication) ray_client = RayClient(num_gpus=4) ray_client.start() # Configure fuzzy deduplication workflow (production settings) fuzzy_workflow = FuzzyDeduplicationWorkflow( input_path="/path/to/input/data", cache_path="/path/to/cache", output_path="/path/to/output", input_filetype="parquet", input_blocksize="1.5GiB", text_field="text", perform_removal=False, # Currently only identification supported # LSH parameters for ~80% similarity threshold num_bands=20, # Number of LSH bands minhashes_per_band=13, # Hashes per band char_ngrams=24, # Character n-gram size seed=42 ) # Run fuzzy deduplication workflow fuzzy_workflow.run() # Cleanup Ray when done ray_client.stop() ``` ### Removing Identified Duplicates The identified duplicates can be removed using a separate workflow: ```python from nemo_curator.core.client import RayClient from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow # Start Ray client ray_client = RayClient() ray_client.start() # Configure workflow with input dataset and output duplicate IDs removal_workflow = TextDuplicatesRemovalWorkflow( input_path="/path/to/input/data", ids_to_remove_path="/path/to/output/FuzzyDuplicateIds", output_path="/path/to/deduplicated/output", input_filetype="parquet", # Same as identification workflow input_blocksize="1.5GiB", # Same as identification workflow ids_to_remove_duplicate_id_field="_curator_dedup_id", id_generator_path="/path/to/output/fuzzy_id_generator.json", ) # Run removal workflow removal_workflow.run() # Cleanup Ray when done ray_client.stop() ```