> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/curator/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/curator/llms-full.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/curator/_mcp/server.

> Essential concepts for text data curation including loading and processing.

# Text Curation Concepts

This document covers the essential concepts for text data curation in NVIDIA NeMo Curator. These concepts assume basic familiarity with data science and machine learning principles.

## Core Concept Areas

Text curation in NeMo Curator focuses on these key areas:

Comprehensive overview of the end-to-end text curation architecture and workflow
overview architecture

Core concepts for loading and managing text datasets from local files
local-files formats

Components for downloading and extracting data from remote sources
remote-sources download

Concepts for filtering, deduplication, and classification
filtering quality

## Infrastructure Components

The text curation concepts build on NVIDIA NeMo Curator's core infrastructure components, which are shared across all modalities. These components include:

Optimize memory usage when processing large datasets
partitioning
batching
monitoring

Leverage NVIDIA GPUs for faster data processing
cuda
rmm
performance

Continue interrupted operations across large datasets
checkpoints
recovery
batching