> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/curator/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/curator/llms-full.txt.

> Core concepts and terminology for NeMo Curator across text, image, video, and audio data curation modalities

# Concepts

Learn about the core components and concepts introduced by NeMo Curator.

## Modality Concepts

Learn about working with specific modalities using NeMo Curator.

<Cards>
  <Card title="Text Curation Concepts" href="/about/concepts/text">
    Learn about text data curation, covering data loading and processing (filtering, classification, deduplication).
  </Card>

  <Card title="Image Curation Concepts" href="/about/concepts/image">
    Explore key concepts for image data curation, including scalable loading, processing (embedding, classification, filtering), and dataset export.
  </Card>

  <Card title="Video Curation Concepts" href="/about/concepts/video">
    Discover video data curation concepts, such as distributed processing, pipeline stages, execution modes, and efficient data flow.
  </Card>

  <Card title="Audio Curation Concepts" href="/about/concepts/audio">
    Learn about speech data curation, ASR inference, quality assessment, and audio-text integration workflows.
  </Card>
</Cards>

## Universal Concepts

Core concepts that apply across all modalities in NeMo Curator.

<Cards>
  <Card title="Deduplication" href="/about/concepts/deduplication">
    Comprehensive overview of deduplication techniques across text, image, and video modalities including exact, fuzzy, and semantic approaches.
  </Card>

  <Card title="Resource Allocation" href="/about/concepts/scaling/resource-allocation">
    How NeMo Curator allocates CPUs, GPUs, and memory across pipeline stages for optimal hardware utilization.
  </Card>

  <Card title="Streaming" href="/about/concepts/scaling/streaming">
    How streaming execution processes data in batches for constant memory usage and higher GPU utilization.
  </Card>

  <Card title="Auto-Balancing" href="/about/concepts/scaling/auto-balancing">
    How the executor automatically balances resources across pipeline stages to eliminate bottlenecks.
  </Card>

  <Card title="Throughput" href="/about/concepts/scaling/throughput">
    Scale from a single GPU to multi-node clusters with near-linear scaling.
  </Card>
</Cards>