Learn the basics of creating an image curation pipeline in Curator by following a complete workflow that filters images by aesthetic quality and NSFW content.
Use this overview to understand how stages pass data through the pipeline.
For more information, refer to the Image Concepts section.
Import required classes and define paths used throughout the example.
Instantiate a named pipeline to orchestrate the stages.
Add modular stages to partition, read, embed, filter, and write images.
Divide tar files across workers for parallel processing.
Load images from tar archives and extract metadata.
Create semantic embeddings for each image using CLIP ViT-L/14.
Score and filter images based on aesthetic quality using a trained predictor.
Detect and filter out NSFW (Not Safe For Work) content.
Save the curated images and metadata to output directory.
Execute the configured pipeline. The pipeline will use XennaExecutor by default if no executor is specified.
Here’s the full pipeline code:
The model_inference_batch_size parameter controls the number of images that the model processes at once, directly affecting GPU memory usage and performance. Higher batch sizes improve throughput but require more GPU memory, while lower batch sizes prevent out-of-memory (OOM) errors.
For the embedding stage (which uses the most GPU memory), each actor requires 0.25 GPU allocation, corresponding to about 20GB of GPU memory on an 80GB GPU (0.25 × 80GB = 20GB). When adjusting batch sizes, ensure memory usage stays below this 20GB threshold per actor.
The default num_gpus_per_worker=0.25 works well for most scenarios and allows 4 workers per GPU. You typically don’t need to change this unless you have specific requirements:
0.25 allows multiple workers per GPU for better utilization1.0 if you want exclusive GPU access per worker0 (embedding and filtering stages require GPUs for optimal performance)The system automatically calculates the optimal number of actors based on available resources and the num_gpus_per_worker setting.
After running this basic curation pipeline, you can:
For more advanced workflows, refer to the Image Duplicate Removal Tutorial.