About NeMo CuratorConceptsText ConceptsData

Text Processing Concepts

View as Markdown

This guide covers the most common text processing workflows in NVIDIA NeMo Curator, based on real-world usage patterns from production data curation pipelines.

Most Common Workflows

The majority of NeMo Curator users follow these core workflows, typically in this order:

1. Quality Filtering

Most users start with basic quality filtering using heuristic filters to remove low-quality content:

Essential Quality Filters:

  • WordCountFilter - Remove too short/long documents
  • NonAlphaNumericFilter - Remove symbol-heavy content
  • RepeatedLinesFilter - Remove if content is too repetitive
  • PunctuationFilter - Ensure proper sentence structure
  • BoilerPlateStringFilter - Remove if content contains too much template/boilerplate text

2. Content Cleaning and Modification

Basic text normalization and cleaning operations:

Common Cleaning Steps:

  • UnicodeReformatter - Normalize Unicode characters
  • NewlineNormalizer - Standardize line breaks
  • Basic HTML/markup removal

3. Deduplication

Remove duplicate and near-duplicate content. For comprehensive coverage of all deduplication approaches, refer to Curator’s Deduplication Concepts.

Exact Deduplication

Remove identical documents, especially useful for smaller datasets:

Implementation: MD5 or SHA-256 hashing for document identification

Fuzzy Deduplication

For production datasets, fuzzy deduplication is essential to remove near-duplicate content across sources:

Key Components:

  • Ray distributed computing framework for scalability
  • Connected components clustering for duplicate identification

Semantic Deduplication

Remove semantically similar content using embeddings for more sophisticated duplicate detection.

Core Processing Architecture

NeMo Curator uses these fundamental building blocks that users combine into pipelines:

ComponentPurposeUsage Pattern
PipelineOrchestrate processing stagesAdd processing stages, typically starting with a read and completing with a write
ScoreFilterApply filters with optional scoringChain multiple quality filters
ModifyTransform document contentClean and normalize text
Reader/Writer StagesLoad and save text dataInput/output for pipelines
Processing StagesTransform DocumentBatch tasksCore processing components

Implementation Examples

Complete Quality Filtering Pipeline

This is the most common starting workflow, used in 90% of production pipelines:

1from nemo_curator.core.client import RayClient
2from nemo_curator.pipeline import Pipeline
3from nemo_curator.stages.text.io.reader import JsonlReader
4from nemo_curator.stages.text.io.writer import JsonlWriter
5from nemo_curator.stages.text.modules import ScoreFilter
6from nemo_curator.stages.text.filters import (
7 WordCountFilter,
8 NonAlphaNumericFilter,
9 RepeatedLinesFilter,
10 PunctuationFilter,
11 BoilerPlateStringFilter
12)
13
14# Start Ray client
15ray_client = RayClient()
16ray_client.start()
17
18# Create processing pipeline
19pipeline = Pipeline(name="quality_filtering")
20
21# Load dataset - the starting point for all workflows
22reader = JsonlReader(file_paths="input_data/")
23pipeline.add_stage(reader)
24
25# Standard quality filtering pipeline (most common)
26# Remove too short/long documents (essential)
27# and save the word_count field
28word_count_filter = ScoreFilter(
29 filter_obj=WordCountFilter(min_words=50, max_words=100000),
30 text_field="text",
31 score_field="word_count"
32)
33pipeline.add_stage(word_count_filter)
34
35# Remove symbol-heavy content
36alpha_numeric_filter = ScoreFilter(
37 filter_obj=NonAlphaNumericFilter(max_non_alpha_numeric_to_text_ratio=0.25),
38 text_field="text"
39)
40pipeline.add_stage(alpha_numeric_filter)
41
42# Remove repetitive content
43repeated_lines_filter = ScoreFilter(
44 filter_obj=RepeatedLinesFilter(max_repeated_line_fraction=0.7),
45 text_field="text"
46)
47pipeline.add_stage(repeated_lines_filter)
48
49# Ensure proper sentence structure
50punctuation_filter = ScoreFilter(
51 filter_obj=PunctuationFilter(max_num_sentences_without_endmark_ratio=0.85),
52 text_field="text"
53)
54pipeline.add_stage(punctuation_filter)
55
56# Remove template/boilerplate text
57boilerplate_filter = ScoreFilter(
58 filter_obj=BoilerPlateStringFilter(),
59 text_field="text"
60)
61pipeline.add_stage(boilerplate_filter)
62
63# Add writer stage
64writer = JsonlWriter(path="filtered_data/")
65pipeline.add_stage(writer)
66
67# Execute pipeline
68results = pipeline.run()
69
70# Cleanup Ray when done
71ray_client.stop()

Content Cleaning Pipeline

Basic text normalization:

1from nemo_curator.core.client import RayClient
2from nemo_curator.pipeline import Pipeline
3from nemo_curator.stages.text.io.reader import JsonlReader
4from nemo_curator.stages.text.io.writer import JsonlWriter
5from nemo_curator.stages.text.modules import Modify
6from nemo_curator.stages.text.modifiers import UnicodeReformatter
7
8# Start Ray client
9ray_client = RayClient()
10ray_client.start()
11
12# Create cleaning pipeline
13pipeline = Pipeline(name="content_cleaning")
14
15# Read input data
16reader = JsonlReader(file_paths="input_data/")
17pipeline.add_stage(reader)
18
19# Essential cleaning steps
20# Normalize unicode characters (very common)
21unicode_modifier = Modify(
22 modifier_fn=UnicodeReformatter(),
23 input_fields="text"
24)
25pipeline.add_stage(unicode_modifier)
26
27# Additional processing steps can be added as needed
28
29# Write cleaned data
30writer = JsonlWriter(path="cleaned_data/")
31pipeline.add_stage(writer)
32
33# Execute pipeline
34results = pipeline.run()
35
36# Cleanup Ray when done
37ray_client.stop()

Exact Deduplication Workflow

Exact deduplication for any dataset size (requires Ray and at least 1 GPU):

1from nemo_curator.core.client import RayClient
2from nemo_curator.stages.deduplication.exact.workflow import ExactDeduplicationWorkflow
3
4# Initialize Ray cluster with GPU support (required for exact deduplication)
5ray_client = RayClient(num_gpus=4)
6ray_client.start()
7
8# Configure exact deduplication workflow
9exact_workflow = ExactDeduplicationWorkflow(
10 input_path="/path/to/input/data",
11 output_path="/path/to/output",
12 text_field="text",
13 perform_removal=False, # Currently only identification supported
14 assign_id=True, # Automatically assign unique IDs
15 input_filetype="parquet",
16)
17
18# Run exact deduplication workflow
19exact_workflow.run()
20
21# Cleanup Ray when done
22ray_client.stop()

Fuzzy Deduplication Workflow

Critical for production datasets (requires Ray and at least 1 GPU):

1from nemo_curator.core.client import RayClient
2from nemo_curator.stages.deduplication.fuzzy.workflow import FuzzyDeduplicationWorkflow
3
4# Initialize Ray cluster with GPU support (required for fuzzy deduplication)
5ray_client = RayClient(num_gpus=4)
6ray_client.start()
7
8# Configure fuzzy deduplication workflow (production settings)
9fuzzy_workflow = FuzzyDeduplicationWorkflow(
10 input_path="/path/to/input/data",
11 cache_path="/path/to/cache",
12 output_path="/path/to/output",
13 input_filetype="parquet",
14 input_blocksize="1.5GiB",
15 text_field="text",
16 perform_removal=False, # Currently only identification supported
17 # LSH parameters for ~80% similarity threshold
18 num_bands=20, # Number of LSH bands
19 minhashes_per_band=13, # Hashes per band
20 char_ngrams=24, # Character n-gram size
21 seed=42
22)
23
24# Run fuzzy deduplication workflow
25fuzzy_workflow.run()
26
27# Cleanup Ray when done
28ray_client.stop()

Removing Identified Duplicates

The identified duplicates can be removed using a separate workflow:

1from nemo_curator.core.client import RayClient
2from nemo_curator.stages.text.deduplication.removal_workflow import TextDuplicatesRemovalWorkflow
3
4# Start Ray client
5ray_client = RayClient()
6ray_client.start()
7
8# Configure workflow with input dataset and output duplicate IDs
9removal_workflow = TextDuplicatesRemovalWorkflow(
10 input_path="/path/to/input/data",
11 ids_to_remove_path="/path/to/output/FuzzyDuplicateIds",
12 output_path="/path/to/deduplicated/output",
13 input_filetype="parquet", # Same as identification workflow
14 input_blocksize="1.5GiB", # Same as identification workflow
15 ids_to_remove_duplicate_id_field="_curator_dedup_id",
16 id_generator_path="/path/to/output/fuzzy_id_generator.json",
17)
18
19# Run removal workflow
20removal_workflow.run()
21
22# Cleanup Ray when done
23ray_client.stop()