***

description: >-
Text processing workflows including quality filtering, fuzzy deduplication,
content cleaning, and pipeline design
categories:

* concepts-architecture
  tags:
* data-processing
* quality-filtering
* deduplication
* pipeline
* distributed
  personas:
* data-scientist-focused
* mle-focused
  difficulty: intermediate
  content\_type: concept
  modality: text-only

***

# Text Processing Concepts

This guide covers the most common text processing workflows in NVIDIA NeMo Curator, based on real-world usage patterns from production data curation pipelines.

## Most Common Workflows

The majority of NeMo Curator users follow these core workflows, typically in this order:

### 1. Quality Filtering

Most users start with basic quality filtering using heuristic filters to remove low-quality content:

**Essential Quality Filters:**

* `WordCountFilter` - Remove too short/long documents
* `NonAlphaNumericFilter` - Remove symbol-heavy content
* `RepeatedLinesFilter` - Remove if content is too repetitive
* `PunctuationFilter` - Ensure proper sentence structure
* `BoilerPlateStringFilter` - Remove if content contains too much template/boilerplate text

### 2. Content Cleaning and Modification

Basic text normalization and cleaning operations:

**Common Cleaning Steps:**

* `UnicodeReformatter` - Normalize Unicode characters
* `NewlineNormalizer` - Standardize line breaks
* Basic HTML/markup removal

### 3. Deduplication

Remove duplicate and near-duplicate content. For comprehensive coverage of all deduplication approaches, refer to Curator's [Deduplication Concepts](/about/concepts/deduplication).

#### Exact Deduplication

Remove identical documents, especially useful for smaller datasets:

**Implementation:** MD5 or SHA-256 hashing for document identification

#### Fuzzy Deduplication

For production datasets, fuzzy deduplication is essential to remove near-duplicate content across sources:

**Key Components:**

* Ray distributed computing framework for scalability
* Connected components clustering for duplicate identification

#### Semantic Deduplication

Remove semantically similar content using embeddings for more sophisticated duplicate detection.

## Core Processing Architecture

NeMo Curator uses these fundamental building blocks that users combine into pipelines:

| Component                | Purpose                             | Usage Pattern                                                                     |
| ------------------------ | ----------------------------------- | --------------------------------------------------------------------------------- |
| **`Pipeline`**           | Orchestrate processing stages       | Add processing stages, typically starting with a read and completing with a write |
| **`ScoreFilter`**        | Apply filters with optional scoring | Chain multiple quality filters                                                    |
| **`Modify`**             | Transform document content          | Clean and normalize text                                                          |
| **Reader/Writer Stages** | Load and save text data             | Input/output for pipelines                                                        |
| **Processing Stages**    | Transform DocumentBatch tasks       | Core processing components                                                        |

## Implementation Examples

### Complete Quality Filtering Pipeline

This is the most common starting workflow, used in 90% of production pipelines:

<Accordion title="Quality Filtering Pipeline Code Example">
  ```python
  from nemo_curator.core.client import RayClient
  from nemo_curator.pipeline import Pipeline
  from nemo_curator.stages.text.io.reader import JsonlReader
  from nemo_curator.stages.text.io.writer import JsonlWriter
  from nemo_curator.stages.text.modules import ScoreFilter
  from nemo_curator.stages.text.filters import (
      WordCountFilter,
      NonAlphaNumericFilter,
      RepeatedLinesFilter,
      PunctuationFilter,
      BoilerPlateStringFilter
  )

  # Start Ray client
  ray_client = RayClient()
  ray_client.start()

  # Create processing pipeline
  pipeline = Pipeline(name="quality_filtering")

  # Load dataset - the starting point for all workflows
  reader = JsonlReader(file_paths="input_data/")
  pipeline.add_stage(reader)

  # Standard quality filtering pipeline (most common)
  # Remove too short/long documents (essential)
  # and save the word_count field
  word_count_filter = ScoreFilter(
      filter_obj=WordCountFilter(min_words=50, max_words=100000),
      text_field="text",
      score_field="word_count"
  )
  pipeline.add_stage(word_count_filter)

  # Remove symbol-heavy content
  alpha_numeric_filter = ScoreFilter(
      filter_obj=NonAlphaNumericFilter(max_non_alpha_numeric_to_text_ratio=0.25),
      text_field="text"
  )
  pipeline.add_stage(alpha_numeric_filter)

  # Remove repetitive content
  repeated_lines_filter = ScoreFilter(
      filter_obj=RepeatedLinesFilter(max_repeated_line_fraction=0.7),
      text_field="text"
  )
  pipeline.add_stage(repeated_lines_filter)

  # Ensure proper sentence structure
  punctuation_filter = ScoreFilter(
      filter_obj=PunctuationFilter(max_num_sentences_without_endmark_ratio=0.85),
      text_field="text"
  )
  pipeline.add_stage(punctuation_filter)

  # Remove template/boilerplate text
  boilerplate_filter = ScoreFilter(
      filter_obj=BoilerPlateStringFilter(),
      text_field="text"
  )
  pipeline.add_stage(boilerplate_filter)

  # Add writer stage
  writer = JsonlWriter(path="filtered_data/")
  pipeline.add_stage(writer)

  # Execute pipeline
  results = pipeline.run()

  # Cleanup Ray when done
  ray_client.stop()
  ```
</Accordion>

### Content Cleaning Pipeline

Basic text normalization:

<Accordion title="Content Cleaning Pipeline Code Example">
  ```python
  from nemo_curator.core.client import RayClient
  from nemo_curator.pipeline import Pipeline
  from nemo_curator.stages.text.io.reader import JsonlReader
  from nemo_curator.stages.text.io.writer import JsonlWriter
  from nemo_curator.stages.text.modules import Modify
  from nemo_curator.stages.text.modifiers import UnicodeReformatter

  # Start Ray client
  ray_client = RayClient()
  ray_client.start()

  # Create cleaning pipeline
  pipeline = Pipeline(name="content_cleaning")

  # Read input data
  reader = JsonlReader(file_paths="input_data/")
  pipeline.add_stage(reader)

  # Essential cleaning steps
  # Normalize unicode characters (very common)
  unicode_modifier = Modify(
      modifier_fn=UnicodeReformatter(),
      input_fields="text"
  )
  pipeline.add_stage(unicode_modifier)

  # Additional processing steps can be added as needed

  # Write cleaned data
  writer = JsonlWriter(path="cleaned_data/")
  pipeline.add_stage(writer)

  # Execute pipeline
  results = pipeline.run()

  # Cleanup Ray when done
  ray_client.stop()
  ```
</Accordion>

### Exact Deduplication Workflow

Exact deduplication for any dataset size (requires Ray and at least 1 GPU):

<Accordion title="Exact Deduplication Code Example">
  ```python
  from nemo_curator.core.client import RayClient
  from nemo_curator.stages.deduplication.exact.workflow import ExactDeduplicationWorkflow

  # Initialize Ray cluster with GPU support (required for exact deduplication)
  ray_client = RayClient(num_gpus=4)
  ray_client.start()

  # Configure exact deduplication workflow
  exact_workflow = ExactDeduplicationWorkflow(
      input_path="/path/to/input/data",
      output_path="/path/to/output",
      text_field="text",
      perform_removal=False,  # Currently only identification supported
      assign_id=True,         # Automatically assign unique IDs
      input_filetype="parquet",
  )

  # Run exact deduplication workflow
  exact_workflow.run()

  # Cleanup Ray when done
  ray_client.stop()
  ```
</Accordion>

### Fuzzy Deduplication Workflow

Critical for production datasets (requires Ray and at least 1 GPU):

<Accordion title="Fuzzy Deduplication Code Example">
  ```python
  from nemo_curator.core.client import RayClient
  from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow

  # Initialize Ray cluster with GPU support (required for fuzzy deduplication)
  ray_client = RayClient(num_gpus=4)
  ray_client.start()

  # Configure fuzzy deduplication workflow (production settings)
  fuzzy_workflow = FuzzyDeduplicationWorkflow(
      input_path="/path/to/input/data",
      cache_path="/path/to/cache",
      output_path="/path/to/output",
      input_filetype="parquet",
      input_blocksize="1.5GiB",
      text_field="text",
      perform_removal=False,  # Currently only identification supported
      # LSH parameters for ~80% similarity threshold
      num_bands=20,           # Number of LSH bands
      minhashes_per_band=13,  # Hashes per band
      char_ngrams=24,         # Character n-gram size
      seed=42
  )

  # Run fuzzy deduplication workflow
  fuzzy_workflow.run()

  # Cleanup Ray when done
  ray_client.stop()
  ```

  ### Removing Identified Duplicates

  The identified duplicates can be removed using a separate workflow:
</Accordion>

<Accordion title="Duplicate Removal Code Example">
  ```python
  from nemo_curator.core.client import RayClient
  from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow

  # Start Ray client
  ray_client = RayClient()
  ray_client.start()

  # Configure workflow with input dataset and output duplicate IDs
  removal_workflow = TextDuplicatesRemovalWorkflow(
      input_path="/path/to/input/data",
      ids_to_remove_path="/path/to/output/FuzzyDuplicateIds",
      output_path="/path/to/deduplicated/output",
      input_filetype="parquet",  # Same as identification workflow
      input_blocksize="1.5GiB",  # Same as identification workflow
      ids_to_remove_duplicate_id_field="_curator_dedup_id",
      id_generator_path="/path/to/output/fuzzy_id_generator.json",
  )

  # Run removal workflow
  removal_workflow.run()

  # Cleanup Ray when done
  ray_client.stop()
  ```
</Accordion>
