*** description: >- Comprehensive text curation capabilities for preparing high-quality data for large language model training with loading, filtering, and quality assessment categories: * workflows tags: * text-curation * data-loading * filtering * deduplication * gpu-accelerated personas: * data-scientist-focused * mle-focused difficulty: beginner content\_type: workflow modality: text-only *** # About Text Curation NeMo Curator provides comprehensive text curation capabilities to prepare high-quality data for large language model (LLM) training. The toolkit includes a collection of processors for loading, filtering, formatting, and analyzing text data from various sources using a [pipeline-based architecture ](/about/concepts/text/data/data-curation-pipeline). ## Use Cases * Clean and prepare web-scraped data from sources like Common Crawl, Wikipedia, and arXiv * Create custom text curation pipelines for specific domain needs * Scale text processing across CPU and GPU clusters efficiently ## Architecture The following diagram provides a high-level outline of NeMo Curator's text curation architecture. ```mermaid flowchart LR A["Data Sources
(Cloud, Local,
Common Crawl, arXiv,
Wikipedia)"] --> B["Data Acquisition
& Loading"] B --> C["Content Processing
& Cleaning"] C --> D["Quality Assessment
& Filtering"] D --> E["Deduplication
(Exact, Fuzzy,
Semantic)"] E --> F["Curated Dataset
(JSONL/Parquet)"] G["Ray + RAPIDS
(GPU-accelerated)"] -.->|"Distributed Execution"| B G -.->|"Distributed Execution"| C G -.->|"GPU Acceleration"| D G -.->|"GPU Acceleration"| E classDef stage fill:#e3f2fd,stroke:#1976d2,stroke-width:2px,color:#000 classDef infra fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,color:#000 classDef output fill:#e8f5e9,stroke:#2e7d32,stroke-width:3px,color:#000 class A,B,C,D,E stage class F output class G infra ``` *** ## Introduction Master the fundamentals of NeMo Curator and set up your text processing environment. Learn about pipeline architecture and core processing stages for efficient text curation data-structures distributed architecture Learn prerequisites, setup instructions, and initial configuration for text curation setup configuration quickstart ## Curation Tasks ### Download Data Download text data from remote sources and import existing datasets into NeMo Curator's processing pipeline. Read existing JSONL and Parquet datasets using Curator's reader stages jsonl parquet Download and extract scientific papers from arXiv academic pdf latex Download and extract web archive data from Common Crawl web-data warc distributed Download and extract Wikipedia articles from Wikipedia dumps articles multilingual dumps Implement a download and extract pipeline for a custom data source jsonl parquet custom-formats ### Process Data Transform and enhance your text data through comprehensive processing and curation steps. Handle multilingual content and language-specific processing language-detection stopwords multilingual Clean, normalize, and transform text content cleaning normalization formatting Remove duplicate and near-duplicate documents efficiently fuzzy-dedup semantic-dedup exact-dedup Score and remove low-quality content heuristics classifiers quality-scoring Domain-specific processing for code and advanced curation tasks code-processing Generate and augment training data using LLMs llm augmentation multilingual nemotron-cc