CLIP ImageEmbeddingStage
The ImageEmbeddingStage generates CLIP embeddings for images using OpenAI’s ViT-L/14 model. These embeddings are essential for downstream tasks such as aesthetic filtering, NSFW detection, and semantic deduplication.
Model Details
- Architecture: OpenAI CLIP ViT-L/14 model
- Output Field:
embedding(stored inImageObject.embedding) - Embedding Dimension: Generated by ViT-L/14 model
- Input Requirements: RGB images loaded by
ImageReaderStage
How It Works
The stage processes ImageBatch objects containing ImageObject instances with loaded image data. It applies CLIP preprocessing, generates embeddings in batches, and stores the results in each ImageObject.embedding attribute.
Prerequisites
Before using the ImageEmbeddingStage, ensure you have:
Model Setup
The CLIP model weights are automatically downloaded from HuggingFace on first use. The stage will:
- Download the OpenAI CLIP ViT-L/14 model (~3.5GB) to the specified
model_dir - Cache the model for subsequent runs
- Load the model onto GPU (or CPU if GPU unavailable)
First-time setup: The initial model download may take several minutes depending on your internet connection. Subsequent runs will use the cached model.
System Requirements
- GPU: NVIDIA GPU with CUDA support (recommended for performance)
- Memory: At least 8GB GPU memory for batch processing
- Disk Space: ~4GB for model weights
- Python Dependencies: PyTorch, transformers (installed with NeMo Curator)
Usage
Parameters
Performance Notes
- The CLIP model requires GPU acceleration for reasonable performance.
- Increase
model_inference_batch_sizefor better throughput if GPU memory allows. - Set
remove_image_data=Trueif you don’t need the raw image data for downstream stages. - The stage automatically handles different image sizes by preprocessing them to 224x224.
Best Practices
- Use GPU-enabled environments for best performance.
- Adjust
model_inference_batch_sizebased on available GPU memory (start with 32, increase if memory allows). - Set
remove_image_data=Truefor memory efficiency if downstream stages only need embeddings. - Monitor GPU utilization and adjust
num_gpus_per_workeraccordingly.
Output Format
After processing, each ImageObject will have: