*** description: Essential concepts for text data curation including loading and processing. categories: * concepts-architecture tags: * concepts * text-curation * data-processing * distributed personas: * data-scientist-focused * mle-focused difficulty: beginner content\_type: concept modality: text-only *** # Text Curation Concepts This document covers the essential concepts for text data curation in NVIDIA NeMo Curator. These concepts assume basic familiarity with data science and machine learning principles. ## Core Concept Areas Text curation in NeMo Curator focuses on these key areas: Comprehensive overview of the end-to-end text curation architecture and workflow overview architecture Core concepts for loading and managing text datasets from local files local-files formats Components for downloading and extracting data from remote sources remote-sources download Concepts for filtering, deduplication, and classification filtering quality ## Infrastructure Components The text curation concepts build on NVIDIA NeMo Curator's core infrastructure components, which are shared across all modalities. These components include: Optimize memory usage when processing large datasets partitioning batching monitoring Leverage NVIDIA GPUs for faster data processing cuda rmm performance Continue interrupted operations across large datasets checkpoints recovery batching