The ImageEmbeddingStage generates CLIP embeddings for images using OpenAI’s ViT-L/14 model. These embeddings are essential for downstream tasks such as aesthetic filtering, NSFW detection, and semantic deduplication.
embedding (stored in ImageObject.embedding)ImageReaderStageThe stage processes ImageBatch objects containing ImageObject instances with loaded image data. It applies CLIP preprocessing, generates embeddings in batches, and stores the results in each ImageObject.embedding attribute.
Before using the ImageEmbeddingStage, ensure you have:
The CLIP model weights are automatically downloaded from HuggingFace on first use. The stage will:
model_dirFirst-time setup: The initial model download may take several minutes depending on your internet connection. Subsequent runs will use the cached model.
model_inference_batch_size for better throughput if GPU memory allows.remove_image_data=True if you don’t need the raw image data for downstream stages.model_inference_batch_size based on available GPU memory (start with 32, increase if memory allows).remove_image_data=True for memory efficiency if downstream stages only need embeddings.num_gpus_per_worker accordingly.After processing, each ImageObject will have: