Text Curation Concepts
This document covers the essential concepts for text data curation in NVIDIA NeMo Curator. These concepts assume basic familiarity with data science and machine learning principles.
Core Concept Areas
Text curation in NeMo Curator focuses on these key areas:
Text Curation Pipeline
Comprehensive overview of the end-to-end text curation architecture and workflow overview architecture
Data Loading
Core concepts for loading and managing text datasets from local files local-files formats
Data Acquisition
Components for downloading and extracting data from remote sources remote-sources download
Data Processing
Concepts for filtering, deduplication, and classification filtering quality
Infrastructure Components
The text curation concepts build on NVIDIA NeMo Curator’s core infrastructure components, which are shared across all modalities. These components include: