For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI Reference
DocumentationAPI Reference
  • Home
    • Welcome
  • About NeMo Curator
    • Overview
    • Key Features
  • Get Started
    • Overview
    • Install (All Modalities)
    • Text Quickstart
    • Image Quickstart
    • Video Quickstart
    • Audio Quickstart
  • Curate Text
    • Overview
    • Tutorials
      • Overview
        • Overview
        • Add IDs
        • Text Cleaning
    • Save and Export
  • Curate Images
    • Overview
    • Save and Export
  • Curate Video
    • Overview
    • Load Data
    • Save and Export
  • Curate Audio
    • Overview
    • Save and Export
  • Setup & Deployment
    • Overview
  • Reference
    • Overview
    • Related Tools
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Curator
On this page
  • How it Works
  • Available Processing Tools
  • Usage
  • Common Processing Tasks
  • Text Normalization
  • Content Sanitization
  • Format Standardization
Curate TextProcess DataContent Processing

Content Processing & Cleaning

||View as Markdown|
Previous

Overview

Next

Add IDs

Clean, normalize, and transform text content to meet specific requirements for training language models using NeMo Curator’s tools and utilities.

Content processing involves transforming your text data while preserving essential information. This includes fixing encoding issues and standardizing text format to ensure high-quality input for model training.

How it Works

Content processing transformations typically modify documents in place or create new versions with specific changes. Most processing tools follow this pattern:

  1. Load your dataset using pipeline readers (JsonlReader, ParquetReader)
  2. Configure and apply the appropriate processor
  3. Save the transformed dataset for further processing

You can combine processing tools in sequence or use them alongside other curation steps like filtering and language management.


Available Processing Tools

Document IDs

Add unique identifiers to documents for tracking and deduplication identifiers tracking preprocessing deduplication

Text Cleaning

Fix Unicode issues, standardize spacing, and remove URLs unicode normalization preprocessing urls

Usage

Here’s an example of a typical content processing pipeline:

1from nemo_curator.core.client import RayClient
2from nemo_curator.pipeline import Pipeline
3from nemo_curator.stages.text.io.reader import JsonlReader
4from nemo_curator.stages.text.io.writer import JsonlWriter
5from nemo_curator.stages.text.modifiers import Modify
6from nemo_curator.stages.text.modifiers.string import UrlRemover, NewlineNormalizer
7from nemo_curator.stages.text.modifiers.unicode import UnicodeReformatter
8
9# Initialize Ray client
10ray_client = RayClient()
11ray_client.start()
12
13# Create a comprehensive cleaning pipeline
14processing_pipeline = Pipeline(
15 name="content_processing_pipeline",
16 description="Comprehensive text cleaning and processing"
17)
18
19# Load dataset
20reader = JsonlReader(file_paths="input_data/")
21processing_pipeline.add_stage(reader)
22
23# Fix Unicode encoding issues
24processing_pipeline.add_stage(
25 Modify(modifier_fn=UnicodeReformatter(), input_fields="text")
26)
27
28# Standardize newlines
29processing_pipeline.add_stage(
30 Modify(modifier_fn=NewlineNormalizer(), input_fields="text")
31)
32
33# Remove URLs
34processing_pipeline.add_stage(
35 Modify(modifier_fn=UrlRemover(), input_fields="text")
36)
37
38# Save the processed dataset
39writer = JsonlWriter(path="processed_output/")
40processing_pipeline.add_stage(writer)
41
42# Execute pipeline
43results = processing_pipeline.run()
44
45# Stop Ray client
46ray_client.stop()

Common Processing Tasks

Text Normalization

  • Fix broken Unicode characters (mojibake)
  • Standardize whitespace and newlines
  • Remove or normalize special characters

Content Sanitization

  • Strip unwanted URLs or links
  • Remove boilerplate text or headers

Format Standardization

  • Ensure consistent text encoding
  • Normalize punctuation and spacing
  • Standardize document structure