***

description: Essential concepts for text data curation including loading and processing.
categories:

* concepts-architecture
  tags:
* concepts
* text-curation
* data-processing
* distributed
  personas:
* data-scientist-focused
* mle-focused
  difficulty: beginner
  content\_type: concept
  modality: text-only

***

# Text Curation Concepts

This document covers the essential concepts for text data curation in NVIDIA NeMo Curator. These concepts assume basic familiarity with data science and machine learning principles.

## Core Concept Areas

Text curation in NeMo Curator focuses on these key areas:

<Cards>
  <Card title="Text Curation Pipeline" href="/about/concepts/text/data/data-curation-pipeline">
    Comprehensive overview of the end-to-end text curation architecture and workflow
    overview architecture
  </Card>

  <Card title="Data Loading" href="/about/concepts/text/data/loading">
    Core concepts for loading and managing text datasets from local files
    local-files formats
  </Card>

  <Card title="Data Acquisition" href="/about/concepts/text/data/acquisition">
    Components for downloading and extracting data from remote sources
    remote-sources download
  </Card>

  <Card title="Data Processing" href="/about/concepts/text/data/processing">
    Concepts for filtering, deduplication, and classification
    filtering quality
  </Card>
</Cards>

## Infrastructure Components

The text curation concepts build on NVIDIA NeMo Curator's core infrastructure components, which are shared across all modalities. These components include:

<Cards>
  <Card title="Memory Management" href="/reference/infra/memory-management">
    Optimize memory usage when processing large datasets
    partitioning
    batching
    monitoring
  </Card>

  <Card title="GPU Acceleration" href="/reference/infra/gpu-processing">
    Leverage NVIDIA GPUs for faster data processing
    cuda
    rmm
    performance
  </Card>

  <Card title="Resumable Processing" href="/reference/infra/resumable-processing">
    Continue interrupted operations across large datasets
    checkpoints
    recovery
    batching
  </Card>
</Cards>
