NeMo Curator is an open-source, enterprise-grade platform for scalable, privacy-aware data curation across text, image, video, and audio modalities.
NeMo Curator, part of the NVIDIA NeMo software suite for managing the AI agent lifecycle, helps you prepare high-quality, compliant datasets for large language model (LLM) and generative artificial intelligence (AI) training. Whether you work in the cloud, on-premises, or in a hybrid environment, NeMo Curator supports your workflow.
Data scientists and machine learning engineers: Build and curate datasets for LLMs, generative models, and multimodal AI.
Cluster administrators and DevOps professionals: Deploy and scale curation pipelines.
Researchers: Experiment with new data curation techniques and ablation studies.
Enterprises: Ensure data privacy, compliance, and quality for production AI workflows.
NeMo Curator speeds up data curation by using modern hardware and distributed computing frameworks. You can process data efficiently—from a single laptop to a multi-node GPU cluster. With modular pipelines, advanced filtering, and easy integration with machine learning operations (MLOps) tools, leading organizations trust NeMo Curator.
Explore the foundational concepts and terminology used across NeMo Curator.
Learn about text data curation, covering data loading and processing (filtering, deduplication, classification).
Explore key concepts for image data curation, including scalable loading, processing (embedding, classification, filtering, deduplication), and dataset export.
Discover video data curation concepts, such as distributed processing, pipeline stages, execution modes, and efficient data flow.
Learn about speech data curation, ASR inference, quality assessment, and audio-text integration workflows.