About Image Curation#

Use Cases#

Architecture#

Introduction#

Master the fundamentals of NeMo Curator and set up your text processing environment.

Concepts

Learn about DocumentDataset and other core data structures for efficient text curation

Image Curation Concepts
Get Started

Learn prerequisites, setup instructions, and initial configuration for image curation

Get Started with Image Curation

Curation Tasks#

Load Data#

WebDataset

Load and process sharded image-text datasets in the WebDataset format for scalable distributed curation.

WebDataset

Process Data#

Transform and enhance your image data through classification, embeddings, and filters.

Classifiers

Apply built-in classifiers such as Aesthetic and NSFW to score, filter, and curate large image datasets. These models help you assess image quality and remove or flag explicit content for downstream tasks like generative model training and quality control.

Image Classifiers
Embeddings

Generate image embeddings for your dataset using state-of-the-art models from the timm library or custom embedders. Embeddings enable downstream tasks such as classification, filtering, duplicate removal, and similarity search.

Image Embedding

Save & Export#

Save & Export

Save metadata to Parquet, export filtered datasets, and reshard WebDatasets for downstream use. Learn how to efficiently store and prepare your curated image data for training or analysis.

Saving and Exporting Image Datasets