NeMo Curator provides comprehensive text curation capabilities to prepare high-quality data for large language model (LLM) training. The toolkit includes a collection of processors for loading, filtering, formatting, and analyzing text data from various sources using a pipeline-based architecture.
The following diagram provides a high-level outline of NeMo Curator’s text curation architecture.
Master the fundamentals of NeMo Curator and set up your text processing environment.
Learn about pipeline architecture and core processing stages for efficient text curation
Learn prerequisites, setup instructions, and initial configuration for text curation
Download text data from remote sources and import existing datasets into NeMo Curator’s processing pipeline.
Read existing JSONL and Parquet datasets using Curator’s reader stages
Download and extract scientific papers from arXiv
Download and extract web archive data from Common Crawl
Download and extract Wikipedia articles from Wikipedia dumps
Implement a download and extract pipeline for a custom data source
Transform and enhance your text data through comprehensive processing and curation steps.
Handle multilingual content and language-specific processing
Clean, normalize, and transform text content
Remove duplicate and near-duplicate documents efficiently
Score and remove low-quality content
Domain-specific processing for code and advanced curation tasks