Key Features#

NeMo Curator is an enterprise-grade platform for scalable, privacy-aware data curation across text, image, and video. It empowers teams to prepare high-quality, compliant datasets for LLM and AI training, with robust support for distributed, cloud-native, and on-premises workflows. NeMo Curator is trusted by leading organizations for its modular pipelines, advanced filtering, and seamless integration with modern MLOps environments.

Why NeMo Curator?#

  • Trusted by leading organizations for LLM and generative AI data curation

  • Open source, NVIDIA-supported, and actively maintained

  • Seamless integration with enterprise MLOps and data platforms (Kubernetes, Slurm, Spark, Dask)

  • Proven at scale: from laptops to multi-node GPU clusters

Benchmarks & Results#

  • Deduplicated 1.96 trillion tokens in 0.5 hours using 32 NVIDIA H100 GPUs (RedPajama V2 scale)

  • Up to 80% data reduction and significant improvements in downstream model performance (see ablation studies)

  • Efficient curation of Common Crawl: from 2.8TB raw to 0.52TB high-quality data in under 38 hours on 30 CPU nodes


Text Data Curation#

NeMo Curator offers advanced tools for text data loading, cleaning, filtering, deduplication, classification, and synthetic data generation. Built-in modules support language identification, quality estimation, domain and safety classification, and both rule-based and LLM-based PII removal. Pipelines are fully modular and can be customized for diverse NLP and LLM training needs.

Data Loading

Efficiently load and manage massive text datasets, with support for common formats and scalable streaming.

Data Loading Concepts
Data Processing

Advanced filtering, deduplication, classification, and pipeline design for high-quality text curation.

Text Processing Concepts
Synthetic Data & PII Removal

LLM-driven synthetic data generation, prompt engineering, and privacy-preserving PII removal for text datasets.

Data Generation Concepts
Text Curation Quickstart

Set up your environment and run your first text curation pipeline with NeMo Curator.

Get Started with Text Curation

Image Data Curation#

NeMo Curator supports scalable image dataset loading, embedding, classification (aesthetic, NSFW, etc.), filtering, deduplication, and export. It leverages state-of-the-art vision models (for example, CLIP, timm) and DALI for efficient GPU-accelerated processing. Modular pipelines enable rapid experimentation and integration with text and multimodal workflows.

Data Loading

Load and manage large-scale image datasets for curation workflows.

Data Loading Concepts (Image)
Data Processing

Embedding generation, classification (aesthetic, NSFW), filtering, and deduplication for images.

Data Processing Concepts (Image)
Data Export

Export, save, and reshard curated image datasets for downstream use.

Data Export Concepts (Image)
Image Curation Quickstart

Set up your environment and install NeMo Curator’s image modules.

Get Started with Image Curation

Deployment and Integration#

NeMo Curator is designed for distributed, cloud-native, and on-premises deployments. It supports Kubernetes, Slurm, and Spark, and integrates easily with your existing MLOps pipelines. Modular APIs and CLI tools enable flexible orchestration and automation.

Deployment Options

Deploy on Kubernetes, Slurm, or Spark. See the Admin Guide for full deployment and integration options.

About Setup & Deployment
Memory Management

Optimize memory usage and partitioning for large-scale curation workflows.

Memory Management Guide
GPU Acceleration

Leverage NVIDIA GPUs for faster data processing and pipeline acceleration.

GPU Processing Guide
Resumable Processing

Continue interrupted operations and recover large dataset processing with checkpointing and batching.

Resumable Processing