Processing | NeMo Curator

This guide covers the most common text processing workflows in NVIDIA NeMo Curator, based on real-world usage patterns from production data curation pipelines.

Most Common Workflows

The majority of NeMo Curator users follow these core workflows, typically in this order:

1. Quality Filtering

Most users start with basic quality filtering using heuristic filters to remove low-quality content:

Essential Quality Filters:

WordCountFilter - Remove too short/long documents
NonAlphaNumericFilter - Remove symbol-heavy content
RepeatedLinesFilter - Remove if content is too repetitive
PunctuationFilter - Ensure proper sentence structure
BoilerPlateStringFilter - Remove if content contains too much template/boilerplate text

2. Content Cleaning and Modification

Basic text normalization and cleaning operations:

Common Cleaning Steps:

UnicodeReformatter - Normalize Unicode characters
NewlineNormalizer - Standardize line breaks
Basic HTML/markup removal

3. Deduplication

Remove duplicate and near-duplicate content. For comprehensive coverage of all deduplication approaches, refer to Curator’s Deduplication Concepts.

Exact Deduplication

Remove identical documents, especially useful for smaller datasets:

Implementation: MD5 or SHA-256 hashing for document identification

Fuzzy Deduplication

For production datasets, fuzzy deduplication is essential to remove near-duplicate content across sources:

Key Components:

Ray distributed computing framework for scalability
Connected components clustering for duplicate identification

Semantic Deduplication

Remove semantically similar content using embeddings for more sophisticated duplicate detection.

Core Processing Architecture

NeMo Curator uses these fundamental building blocks that users combine into pipelines:

Component	Purpose	Usage Pattern
`Pipeline`	Orchestrate processing stages	Add processing stages, typically starting with a read and completing with a write
`ScoreFilter`	Apply filters with optional scoring	Chain multiple quality filters
`Modify`	Transform document content	Clean and normalize text
Reader/Writer Stages	Load and save text data	Input/output for pipelines
Processing Stages	Transform DocumentBatch tasks	Core processing components

Implementation Examples

Complete Quality Filtering Pipeline

This is the most common starting workflow, used in 90% of production pipelines:

Quality Filtering Pipeline Code Example

1 from nemo_curator.core.client import RayClient
2 from nemo_curator.pipeline import Pipeline
3 from nemo_curator.stages.text.io.reader import JsonlReader
4 from nemo_curator.stages.text.io.writer import JsonlWriter
5 from nemo_curator.stages.text.filters import ScoreFilter
6 from nemo_curator.stages.text.filters import (
7     WordCountFilter,
8     NonAlphaNumericFilter,
9     RepeatedLinesFilter,
10     PunctuationFilter,
11     BoilerPlateStringFilter
12 )
13 
14 # Start Ray client
15 ray_client = RayClient()
16 ray_client.start()
17 
18 # Create processing pipeline
19 pipeline = Pipeline(name="quality_filtering")
20 
21 # Load dataset - the starting point for all workflows
22 reader = JsonlReader(file_paths="input_data/")
23 pipeline.add_stage(reader)
24 
25 # Standard quality filtering pipeline (most common)
26 # Remove too short/long documents (essential)
27 # and save the word_count field
28 word_count_filter = ScoreFilter(
29     filter_obj=WordCountFilter(min_words=50, max_words=100000),
30     text_field="text",
31     score_field="word_count"
32 )
33 pipeline.add_stage(word_count_filter)
34 
35 # Remove symbol-heavy content
36 alpha_numeric_filter = ScoreFilter(
37     filter_obj=NonAlphaNumericFilter(max_non_alpha_numeric_to_text_ratio=0.25),
38     text_field="text"
39 )
40 pipeline.add_stage(alpha_numeric_filter)
41 
42 # Remove repetitive content
43 repeated_lines_filter = ScoreFilter(
44     filter_obj=RepeatedLinesFilter(max_repeated_line_fraction=0.7),
45     text_field="text"
46 )
47 pipeline.add_stage(repeated_lines_filter)
48 
49 # Ensure proper sentence structure
50 punctuation_filter = ScoreFilter(
51     filter_obj=PunctuationFilter(max_num_sentences_without_endmark_ratio=0.85),
52     text_field="text"
53 )
54 pipeline.add_stage(punctuation_filter)
55 
56 # Remove template/boilerplate text
57 boilerplate_filter = ScoreFilter(
58     filter_obj=BoilerPlateStringFilter(),
59     text_field="text"
60 )
61 pipeline.add_stage(boilerplate_filter)
62 
63 # Add writer stage
64 writer = JsonlWriter(path="filtered_data/")
65 pipeline.add_stage(writer)
66 
67 # Execute pipeline
68 results = pipeline.run()
69 
70 # Cleanup Ray when done
71 ray_client.stop()

Content Cleaning Pipeline

Basic text normalization:

Content Cleaning Pipeline Code Example

1 from nemo_curator.core.client import RayClient
2 from nemo_curator.pipeline import Pipeline
3 from nemo_curator.stages.text.io.reader import JsonlReader
4 from nemo_curator.stages.text.io.writer import JsonlWriter
5 from nemo_curator.stages.text.modifiers import Modify
6 from nemo_curator.stages.text.modifiers.unicode import UnicodeReformatter
7 
8 # Start Ray client
9 ray_client = RayClient()
10 ray_client.start()
11 
12 # Create cleaning pipeline
13 pipeline = Pipeline(name="content_cleaning")
14 
15 # Read input data
16 reader = JsonlReader(file_paths="input_data/")
17 pipeline.add_stage(reader)
18 
19 # Essential cleaning steps
20 # Normalize unicode characters (very common)
21 unicode_modifier = Modify(
22     modifier_fn=UnicodeReformatter(),
23     input_fields="text"
24 )
25 pipeline.add_stage(unicode_modifier)
26 
27 # Additional processing steps can be added as needed
28 
29 # Write cleaned data
30 writer = JsonlWriter(path="cleaned_data/")
31 pipeline.add_stage(writer)
32 
33 # Execute pipeline
34 results = pipeline.run()
35 
36 # Cleanup Ray when done
37 ray_client.stop()

Exact Deduplication Workflow

Exact deduplication for any dataset size (requires Ray and at least 1 GPU):

Exact Deduplication Code Example

1 from nemo_curator.core.client import RayClient
2 from nemo_curator.stages.deduplication.exact.workflow import ExactDeduplicationWorkflow
3 
4 # Initialize Ray cluster with GPU support (required for exact deduplication)
5 ray_client = RayClient(num_gpus=4)
6 ray_client.start()
7 
8 # Configure exact deduplication workflow
9 exact_workflow = ExactDeduplicationWorkflow(
10     input_path="/path/to/input/data",
11     output_path="/path/to/output",
12     text_field="text",
13     perform_removal=False,  # Currently only identification supported
14     assign_id=True,         # Automatically assign unique IDs
15     input_filetype="parquet",
16 )
17 
18 # Run exact deduplication workflow
19 exact_workflow.run()
20 
21 # Cleanup Ray when done
22 ray_client.stop()

Fuzzy Deduplication Workflow

Critical for production datasets (requires Ray and at least 1 GPU):

Fuzzy Deduplication Code Example

1 from nemo_curator.core.client import RayClient
2 from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow
3 
4 # Initialize Ray cluster with GPU support (required for fuzzy deduplication)
5 ray_client = RayClient(num_gpus=4)
6 ray_client.start()
7 
8 # Configure fuzzy deduplication workflow (production settings)
9 fuzzy_workflow = FuzzyDeduplicationWorkflow(
10     input_path="/path/to/input/data",
11     cache_path="/path/to/cache",
12     output_path="/path/to/output",
13     input_filetype="parquet",
14     input_blocksize="1.5GiB",
15     text_field="text",
16     perform_removal=False,  # Currently only identification supported
17     # LSH parameters for ~80% similarity threshold
18     num_bands=20,           # Number of LSH bands
19     minhashes_per_band=13,  # Hashes per band
20     char_ngrams=24,         # Character n-gram size
21     seed=42
22 )
23 
24 # Run fuzzy deduplication workflow
25 fuzzy_workflow.run()
26 
27 # Cleanup Ray when done
28 ray_client.stop()

Removing Identified Duplicates

The identified duplicates can be removed using a separate workflow:

Duplicate Removal Code Example

1 from nemo_curator.core.client import RayClient
2 from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow
3 
4 # Start Ray client
5 ray_client = RayClient()
6 ray_client.start()
7 
8 # Configure workflow with input dataset and output duplicate IDs
9 removal_workflow = TextDuplicatesRemovalWorkflow(
10     input_path="/path/to/input/data",
11     ids_to_remove_path="/path/to/output/FuzzyDuplicateIds",
12     output_path="/path/to/deduplicated/output",
13     input_filetype="parquet",  # Same as identification workflow
14     input_blocksize="1.5GiB",  # Same as identification workflow
15     ids_to_remove_duplicate_id_field="_curator_dedup_id",
16     id_generator_path="/path/to/output/fuzzy_id_generator.json",
17 )
18 
19 # Run removal workflow
20 removal_workflow.run()
21 
22 # Cleanup Ray when done
23 ray_client.stop()