Text Curation Concepts#
This document covers the essential concepts for text data curation in NVIDIA NeMo Curator. These concepts assume basic familiarity with data science and machine learning principles.
Core Concept Areas#
Text curation in NVIDIA NeMo Curator focuses on these key areas:
Comprehensive overview of the end-to-end text curation architecture and workflow
Core concepts for loading and managing text datasets from local files
Components for downloading and extracting data from remote sources
Concepts for filtering, deduplication, and classification
Concepts for generating high-quality synthetic text
Infrastructure Components#
The text curation concepts build on NVIDIA NeMo Curator’s core infrastructure components, which are shared across all modalities. These components include:
Configure and manage distributed processing across multiple machines
Optimize memory usage when processing large datasets
Leverage NVIDIA GPUs for faster data processing
Continue interrupted operations across large datasets