Aesthetic Filter | NeMo Curator

The Aesthetic Filter predicts the subjective visual quality of images using a model trained on human aesthetic preferences. It outputs an aesthetic score (higher values show more aesthetic images), making it useful for filtering or ranking images in generative pipelines and dataset curation.

Model Details

Architecture: Multi-layer neural network (MLP) trained on OpenAI CLIP ViT-L/14 image embeddings
Source: Improved Aesthetic Predictor
Output Field: aesthetic_score
Score Range: Continuous values (higher is more aesthetic)
Embedding Input: CLIP ViT-L/14 embeddings (see Image embeddings)

How It Works

The filter takes pre-computed CLIP ViT-L/14 image embeddings from a previous pipeline stage and predicts an aesthetic score. The lightweight model processes batches of embeddings efficiently on the GPU.

Prerequisites

Before using the ImageAestheticFilterStage, ensure you have:

Model Setup

The aesthetic predictor model weights are automatically downloaded from HuggingFace on first use. The stage will:

Download the improved aesthetic predictor model (~20MB) to the specified model_dir
Cache the model for subsequent runs
Load the model onto GPU (or CPU if GPU unavailable)

First-time setup: The initial model download is quick (under 1 minute on most connections). Subsequent runs will use the cached model.

Required Input

CLIP Embeddings: Images must have embeddings already generated by ImageEmbeddingStage
Embedding Format: CLIP ViT-L/14 768-dimensional vectors stored in ImageObject.embedding

Usage

Python

1 from nemo_curator.pipeline import Pipeline
2 from nemo_curator.stages.file_partitioning import FilePartitioningStage
3 from nemo_curator.stages.image.io.image_reader import ImageReaderStage
4 from nemo_curator.stages.image.embedders.clip_embedder import ImageEmbeddingStage
5 from nemo_curator.stages.image.filters.aesthetic_filter import ImageAestheticFilterStage
6 
7 # Create pipeline
8 pipeline = Pipeline(name="aesthetic_filtering", description="Filter images by aesthetic quality")
9 
10 # Stage 1: Partition tar files
11 pipeline.add_stage(FilePartitioningStage(
12     file_paths="/path/to/tar_dataset",
13     files_per_partition=1,
14     file_extensions=[".tar"],
15 ))
16 
17 # Stage 2: Read images
18 pipeline.add_stage(ImageReaderStage(
19     dali_batch_size=100,
20     num_gpus_per_worker=0.25,
21 ))
22 
23 # Stage 3: Generate CLIP embeddings
24 pipeline.add_stage(ImageEmbeddingStage(
25     model_dir="/path/to/models",
26     model_inference_batch_size=32,
27     num_gpus_per_worker=0.25,
28 ))
29 
30 # Stage 4: Apply aesthetic filtering
31 pipeline.add_stage(ImageAestheticFilterStage(
32     model_dir="/path/to/models",
33     score_threshold=0.5,
34     model_inference_batch_size=32,
35     num_gpus_per_worker=0.25,
36 ))
37 
38 # Run the pipeline (uses XennaExecutor by default)
39 results = pipeline.run()

Parameters

Parameter	Type	Default	Description
`model_dir`	str	None	Path to directory containing model weights
`score_threshold`	float	0.5	Aesthetic score threshold for filtering (filters out images below this threshold)
`model_inference_batch_size`	int	32	Batch size for model inference
`num_gpus_per_worker`	float	0.25	GPU allocation per worker (0.25 = 1/4 GPU)
`verbose`	bool	False	Enable verbose logging for debugging

Performance Notes

The model is small and processes pre-computed embeddings efficiently on the GPU.
Increase batch size for faster throughput if memory allows.

Best Practices

Use CLIP ViT-L/14 embeddings generated by ImageEmbeddingStage for best results.
Run the aesthetic filter after embedding generation in the same pipeline to avoid extra I/O.
The filter requires pre-computed embeddings and cannot extract embeddings from raw images.
Review a sample of scores to calibrate thresholds for your use case.
Adjust model_inference_batch_size based on available GPU memory.