***

description: >-
NeMo Curator is an open-source, scalable data curation platform for curating
large datasets across text, image, video, and audio modalities to improve AI
model training
categories:

* documentation
* home
  tags:
* data-curation
* multimodal
* scalable
* gpu-accelerated
* distributed
  personas:
* Data Scientists
* Machine Learning Engineers
* Cluster Administrators
* DevOps Professionals
  difficulty: beginner
  content\_type: index
  modality: universal

***

# NeMo Curator Documentation

Welcome to the NeMo Curator documentation.

## Introduction to Curator

Learn about the Curator, how it works at a high-level, and the key features.

<Cards>
  <Card title="About Curator" href="/about">
    Overview of NeMo Curator and its capabilities.
    target-users how-it-works
  </Card>

  <Card title="Key Features" href="/about/key-features">
    Discover the main features of NeMo Curator for data curation.
    features capabilities deployments
  </Card>

  <Card title="Concepts" href="/about/concepts">
    Explore the core concepts for each modality in NeMo Curator.
    data-loading data-processing data-generation
  </Card>
</Cards>

## Quickstarts

Install and run NeMo Curator for specific modalities.

<Cards>
  <Card title="Text Curation Quickstart" href="/get-started/text">
    Set up and run text curation workflows.
  </Card>

  <Card title="Image Curation Quickstart" href="/get-started/image">
    Set up and run image curation workflows.
  </Card>

  <Card title="Video Curation Quickstart" href="/get-started/video">
    Set up and run video curation workflows.
  </Card>

  <Card title="Audio Curation Quickstart" href="/get-started/audio">
    Set up and run audio curation workflows.
  </Card>
</Cards>

## Data Curation Workflows

### Workflow Modalities

Explore how you can use NeMo Curator across different content modalities.

<Cards>
  <Card title="Curate Text" href="/curate-text">
    Curate and prepare high-quality text datasets for LLM training.
    filtering formatting deduplication
  </Card>

  <Card title="Curate Images" href="/curate-images">
    Curate image-text datasets with embedding, classification, and deduplication.
    embedding classification semantic-deduplication
  </Card>

  <Card title="Curate Videos" href="/curate-video">
    Curate and process videos with GPU-accelerated pipelines and sharding.
    video-splitting video-sharding gpu-accelerated
  </Card>

  <Card title="Curate Audio" href="/curate-audio">
    Transcribe, filter, and curate speech and audio datasets with ASR models.
    asr transcription quality-filtering
  </Card>
</Cards>

## Tutorial Highlights

Check out tutorials to get a quick start on using the NeMo Curator library.

<Cards>
  <Card title="Text Beginner Tutorial" href="/get-started/text">
    Learn the basics of text data processing with NeMo Curator.
    beginner
    text-processing
    data-preparation
  </Card>

  <Card title="Image Beginner Tutorial" href="/get-started/image">
    Learn the basics of image data processing with NeMo Curator.
    beginner
    image-processing
    data-curation
  </Card>

  <Card title="Video Beginner Tutorial" href="/curate-video/tutorials/beginner">
    Learn the basics of video pipeline construction and execution.
    video-splitting
    video-sharding
    custom-pipelines
  </Card>

  <Card title="Audio Beginner Tutorial" href="/get-started/audio">
    Learn the basics of speech data processing with NeMo Curator.
    beginner
    asr-inference
    quality-assessment
  </Card>
</Cards>

***
