Process Data for Image Curation#
Process image data you’ve loaded from tar archives using NeMo Curator’s suite of tools. These tools help you generate embeddings, filter images, and prepare your dataset to produce high-quality data for downstream AI tasks such as generative model training, dataset analysis, or quality control.
How it Works#
Image processing in NeMo Curator follows a pipeline-based approach with these stages:
Partition files using
FilePartitioningStageto distribute tar filesRead images using
ImageReaderStagewith DALI accelerationGenerate embeddings using
ImageEmbeddingStagewith CLIP modelsApply filters using
ImageAestheticFilterStageandImageNSFWFilterStageSave results using
ImageWriterStageto export curated datasets
Each stage processes ImageBatch objects containing images, metadata, and processing results. You can use built-in stages or create custom stages for advanced use cases.
Filter Options#
Assess the subjective quality of images using a model trained on human aesthetic preferences. Filters images based on aesthetic score thresholds.
Detect not-safe-for-work (NSFW) content in images using a CLIP-based filter. Filters explicit material from your datasets.
Embedding Options#
Generate image embeddings using CLIP models with GPU acceleration. Supports various CLIP architectures and automatic model downloading.
Filtering Images#
The filtering stages (ImageAestheticFilterStage, ImageNSFWFilterStage) include filtering capabilities. Images that don’t meet the specified thresholds are automatically filtered out during processing.
Built-in filtering capabilities:
Aesthetic filtering: Remove images with low aesthetic scores using
ImageAestheticFilterStageNSFW filtering: Remove inappropriate content using
ImageNSFWFilterStageAutomatic processing: Filtering happens during the pipeline execution
Pipeline with filtering#
from nemo_curator.stages.image.filters.aesthetic_filter import ImageAestheticFilterStage
from nemo_curator.stages.image.filters.nsfw_filter import ImageNSFWFilterStage
# Filter by aesthetic quality (keep images with score >= 0.5)
pipeline.add_stage(ImageAestheticFilterStage(
model_dir="/models",
score_threshold=0.5, # Minimum aesthetic score
num_gpus_per_worker=0.25,
))
# Filter NSFW content (keep images with score < 0.5)
pipeline.add_stage(ImageNSFWFilterStage(
model_dir="/models",
score_threshold=0.5, # Maximum NSFW score (images below this are kept)
num_gpus_per_worker=0.25,
))
For custom filtering logic, you can create your own stage by extending ProcessingStage[ImageBatch, ImageBatch].