Overview of NeMo Curator#

NeMo Curator is an open-source, enterprise-grade platform for scalable, privacy-aware data curation across text, image, and video modalities.

NeMo Curator helps you prepare high-quality, compliant datasets for large language model (LLM) and generative artificial intelligence (AI) training. Whether you work in the cloud, on-premises, or in a hybrid environment, NeMo Curator supports your workflow.

Target Users#

  • Data scientists and machine learning engineers: Build and curate datasets for LLMs, generative models, and multimodal AI.

  • Cluster administrators and DevOps professionals: Deploy and scale curation pipelines on Kubernetes, Slurm, or Apache Spark clusters.

  • Researchers: Experiment with new data curation techniques, synthetic data generation, and ablation studies.

  • Enterprises: Ensure data privacy, compliance, and quality for production AI workflows.

How It Works#

NeMo Curator speeds up data curation by using modern hardware and distributed computing frameworks. You can process data efficiently—from a single laptop to a multi-node GPU cluster. With modular pipelines, advanced filtering, and easy integration with machine learning operations (MLOps) tools, NeMo Curator is trusted by leading organizations.

  • Text Curation: Data flows through loaders and processors (cleaning, filtering, deduplication), and exporters, all built atop Dask for distributed execution.

  • Image Curation: Uses WebDataset sharding, NVIDIA Data Loading Library (DALI) for GPU-accelerated loading, and modular steps for embedding, classification, filtering, and export.

Key Technologies#

  • Graphics Processing Units (GPUs): Accelerate data processing for large-scale workloads.

  • Distributed Computing: Supports frameworks like Dask, RAPIDS, and Ray for scalable, parallel processing.

  • Modular Pipelines: Build, customize, and scale curation workflows to fit your needs.

  • MLOps Integration: Seamlessly connects with modern MLOps environments for production-ready workflows.

Concepts#

Explore the foundational concepts and terminology used across NeMo Curator.

Text Curation Concepts

Learn about text data curation, covering data loading, processing (filtering, deduplication, classification), and synthetic data generation.

Text Curation Concepts
Image Curation Concepts

Explore key concepts for image data curation, including scalable loading, processing (embedding, classification, filtering, deduplication), and dataset export.

Image Curation Concepts