Process Data for Image Curation#
Process image data you’ve loaded from tar archives using NeMo Curator’s suite of tools. These tools help you generate embeddings, filter images, and prepare your dataset to produce high-quality data for downstream AI tasks such as generative model training, dataset analysis, or quality control.
How it Works#
Image processing in NeMo Curator follows a pipeline-based approach with these stages:
Partition files using
FilePartitioningStage
to distribute tar filesRead images using
ImageReaderStage
with DALI accelerationGenerate embeddings using
ImageEmbeddingStage
with CLIP modelsApply filters using
ImageAestheticFilterStage
andImageNSFWFilterStage
Save results using
ImageWriterStage
to export curated datasets
Each stage processes ImageBatch
objects containing images, metadata, and processing results. You can use built-in stages or create custom stages for advanced use cases.
Filter Options#
Assess the subjective quality of images using a model trained on human aesthetic preferences. Filters images based on aesthetic score thresholds.
Detect not-safe-for-work (NSFW) content in images using a CLIP-based filter. Filters explicit material from your datasets.
Embedding Options#
Generate image embeddings using CLIP models with GPU acceleration. Supports various CLIP architectures and automatic model downloading.
Filtering Images#
The filtering stages (ImageAestheticFilterStage
, ImageNSFWFilterStage
) include filtering capabilities. Images that don’t meet the specified thresholds are automatically filtered out during processing.
Built-in filtering capabilities:
Aesthetic filtering: Remove images with low aesthetic scores using
ImageAestheticFilterStage
NSFW filtering: Remove inappropriate content using
ImageNSFWFilterStage
Automatic processing: Filtering happens during the pipeline execution
Pipeline with filtering#
from nemo_curator.stages.image.filters.aesthetic_filter import ImageAestheticFilterStage
from nemo_curator.stages.image.filters.nsfw_filter import ImageNSFWFilterStage
# Filter by aesthetic quality (keep images with score >= 0.5)
pipeline.add_stage(ImageAestheticFilterStage(
model_dir="/models",
score_threshold=0.5, # Minimum aesthetic score
num_gpus_per_worker=0.25,
))
# Filter NSFW content (keep images with score < 0.5)
pipeline.add_stage(ImageNSFWFilterStage(
model_dir="/models",
score_threshold=0.5, # Maximum NSFW score (images below this are kept)
num_gpus_per_worker=0.25,
))
For custom filtering logic, you can create your own stage by extending ProcessingStage[ImageBatch, ImageBatch]
.