***
description: Essential concepts for text data curation including loading and processing.
categories:
* concepts-architecture
tags:
* concepts
* text-curation
* data-processing
* distributed
personas:
* data-scientist-focused
* mle-focused
difficulty: beginner
content\_type: concept
modality: text-only
***
# Text Curation Concepts
This document covers the essential concepts for text data curation in NVIDIA NeMo Curator. These concepts assume basic familiarity with data science and machine learning principles.
## Core Concept Areas
Text curation in NeMo Curator focuses on these key areas:
Comprehensive overview of the end-to-end text curation architecture and workflow
overview architecture
Core concepts for loading and managing text datasets from local files
local-files formats
Components for downloading and extracting data from remote sources
remote-sources download
Concepts for filtering, deduplication, and classification
filtering quality
## Infrastructure Components
The text curation concepts build on NVIDIA NeMo Curator's core infrastructure components, which are shared across all modalities. These components include:
Optimize memory usage when processing large datasets
partitioning
batching
monitoring
Leverage NVIDIA GPUs for faster data processing
cuda
rmm
performance
Continue interrupted operations across large datasets
checkpoints
recovery
batching