Process Data for Image Curation#
Process image data you’ve loaded into a WebDataset using NeMo Curator’s suite of tools. These tools help you generate embeddings, classify images, and filter your dataset to prepare high-quality data for downstream AI tasks such as generative model training, dataset analysis, or quality control.
How it Works#
Image processing in NeMo Curator typically follows these steps:
Load your dataset using
ImageTextPairDataset
Generate image embeddings using a built-in or custom embedder
Apply classifiers (such as aesthetic or NSFW) to score or filter images
Filter images based on classifier scores or metadata
Save or export your curated dataset for downstream use
You can use NeMo Curator’s built-in tools or implement your own for advanced use cases.
Classifier Options#
Assess the subjective quality of images using a model trained on human aesthetic preferences. Useful for filtering or ranking images by visual appeal.
Detect not-safe-for-work (NSFW) content in images using a CLIP-based classifier. Helps remove or flag explicit material from your datasets.
Embedding Options#
Use state-of-the-art models from the PyTorch Image Models (timm) library for embedding generation. Highly recommended for most users.
Implement your own image embedding logic by subclassing the base class. Useful for research models or custom pipelines.
Filtering Images#
Filter images in your dataset by applying thresholds to classifier scores (such as aesthetic or NSFW) or by using metadata fields. Unlike text curation, NeMo Curator does not currently provide built-in heuristic or content-based filters for images. Filtering is typically performed as a post-processing step after classification and embedding.
Common filtering strategies:
Remove images with low aesthetic scores
Remove or flag images with high NSFW scores
Filter by metadata (e.g., resolution, aspect ratio)
Example: Filtering by classifier score in Python
import dask_cudf
# Assume df is a Dask-cuDF DataFrame with 'aesthetic_score' and 'nsfw_score' columns
filtered = df[(df['aesthetic_score'] > 0.5) & (df['nsfw_score'] < 0.2)]
You can also implement custom filtering logic based on your project needs.