For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI Reference
DocumentationAPI Reference
  • Home
    • Welcome
  • About NeMo Curator
    • Overview
    • Key Features
  • Get Started
    • Overview
    • Text Quickstart
    • Image Quickstart
    • Video Quickstart
    • Audio Quickstart
  • Curate Text
    • Overview
    • Tutorials
  • Curate Images
    • Overview
    • Save and Export
  • Curate Video
    • Overview
    • Load Data
    • Save and Export
  • Curate Audio
    • Overview
    • Save and Export
  • Setup & Deployment
    • Overview
    • Installation
  • Reference
    • Overview
    • Related Tools
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Curator
On this page
  • Target Users
  • How It Works
  • Key Technologies
  • Concepts
About NeMo Curator

Overview of NeMo Curator

||View as Markdown|
Previous

Welcome

Next

Key Features

NeMo Curator is an open-source, enterprise-grade platform for scalable, privacy-aware data curation across text, image, video, and audio modalities.

NeMo Curator, part of the NVIDIA NeMo software suite for managing the AI agent lifecycle, helps you prepare high-quality, compliant datasets for large language model (LLM) and generative artificial intelligence (AI) training. Whether you work in the cloud, on-premises, or in a hybrid environment, NeMo Curator supports your workflow.

Target Users

  • Data scientists and machine learning engineers: Build and curate datasets for LLMs, generative models, and multimodal AI.

  • Cluster administrators and DevOps professionals: Deploy and scale curation pipelines.

  • Researchers: Experiment with new data curation techniques and ablation studies.

  • Enterprises: Ensure data privacy, compliance, and quality for production AI workflows.

How It Works

NeMo Curator speeds up data curation by using modern hardware and distributed computing frameworks. You can process data efficiently—from a single laptop to a multi-node GPU cluster. With modular pipelines, advanced filtering, and easy integration with machine learning operations (MLOps) tools, leading organizations trust NeMo Curator.

  • Text Curation: Uses a pipeline-based architecture with modular processing stages running on Ray. Data flows through data download, extraction, language detection, rule-based quality filtering, deduplication (exact, fuzzy and semantic) and model based quality filtering.
  • Image Curation: Uses pipeline-based architecture with modular stages for loading, embedding generation, classification (aesthetic, NSFW), filtering, and export workflows. Supports distributed processing with optional GPU acceleration.
  • Video Curation: Employs Ray-based pipelines to split long videos into clips using fixed stride or scene-change detection, with optional encoding, filtering, embedding generation, and deduplication for large-scale video processing.
  • Audio Curation: Provides ASR inference using models, quality assessment through Word Error Rate (WER) calculation, duration analysis, and integration with text curation workflows for speech data processing.

Key Technologies

  • Graphics Processing Units (GPUs): Speed up data processing for large-scale workloads.
  • Distributed Computing: Supports frameworks like Dask, RAPIDS, and Ray for scalable, parallel processing.
  • Modular Pipelines: Build, customize, and scale curation workflows to fit your needs.
  • MLOps Integration: Seamlessly connects with modern MLOps environments for production-ready workflows.

Concepts

Explore the foundational concepts and terminology used across NeMo Curator.

Text Curation Concepts

Learn about text data curation, covering data loading and processing (filtering, deduplication, classification).

Image Curation Concepts

Explore key concepts for image data curation, including scalable loading, processing (embedding, classification, filtering, deduplication), and dataset export.

Video Curation Concepts

Discover video data curation concepts, such as distributed processing, pipeline stages, execution modes, and efficient data flow.

Audio Curation Concepts

Learn about speech data curation, ASR inference, quality assessment, and audio-text integration workflows.