About Image Curation#
Use Cases#
Architecture#
Introduction#
Master the fundamentals of NeMo Curator and set up your text processing environment.
Learn about DocumentDataset and other core data structures for efficient text curation
Learn prerequisites, setup instructions, and initial configuration for image curation
Curation Tasks#
Load Data#
Load and process sharded image-text datasets in the WebDataset format for scalable distributed curation.
Process Data#
Transform and enhance your image data through classification, embeddings, and filters.
Apply built-in classifiers such as Aesthetic and NSFW to score, filter, and curate large image datasets. These models help you assess image quality and remove or flag explicit content for downstream tasks like generative model training and quality control.
Generate image embeddings for your dataset using state-of-the-art models from the timm library or custom embedders. Embeddings enable downstream tasks such as classification, filtering, duplicate removal, and similarity search.
Save & Export#
Save metadata to Parquet, export filtered datasets, and reshard WebDatasets for downstream use. Learn how to efficiently store and prepare your curated image data for training or analysis.