*** description: >- Generate CLIP embeddings for images using OpenAI's ViT-L/14 model for downstream classification and filtering tasks categories: * how-to-guides tags: * embedding * clip * vit * gpu-accelerated * pipeline-stage personas: * data-scientist-focused * mle-focused difficulty: intermediate content\_type: how-to modality: image-only *** # CLIP ImageEmbeddingStage The `ImageEmbeddingStage` generates CLIP embeddings for images using OpenAI's ViT-L/14 model. These embeddings are essential for downstream tasks such as aesthetic filtering, NSFW detection, and semantic deduplication. ## Model Details * **Architecture:** [OpenAI CLIP ViT-L/14 model](https://huggingface.co/openai/clip-vit-large-patch14) * **Output Field:** `embedding` (stored in `ImageObject.embedding`) * **Embedding Dimension:** Generated by ViT-L/14 model * **Input Requirements:** RGB images loaded by `ImageReaderStage` ## How It Works The stage processes `ImageBatch` objects containing `ImageObject` instances with loaded image data. It applies CLIP preprocessing, generates embeddings in batches, and stores the results in each `ImageObject.embedding` attribute. ## Prerequisites Before using the `ImageEmbeddingStage`, ensure you have: ### Model Setup The CLIP model weights are automatically downloaded from HuggingFace on first use. The stage will: 1. Download the OpenAI CLIP ViT-L/14 model (\~3.5GB) to the specified `model_dir` 2. Cache the model for subsequent runs 3. Load the model onto GPU (or CPU if GPU unavailable) **First-time setup:** The initial model download may take several minutes depending on your internet connection. Subsequent runs will use the cached model. ### System Requirements * **GPU:** NVIDIA GPU with CUDA support (recommended for performance) * **Memory:** At least 8GB GPU memory for batch processing * **Disk Space:** \~4GB for model weights * **Python Dependencies:** PyTorch, transformers (installed with NeMo Curator) ## Usage ```python from nemo_curator.pipeline import Pipeline from nemo_curator.stages.file_partitioning import FilePartitioningStage from nemo_curator.stages.image.io.image_reader import ImageReaderStage from nemo_curator.stages.image.embedders.clip_embedder import ImageEmbeddingStage # Create pipeline pipeline = Pipeline(name="image_embedding", description="Generate CLIP embeddings for images") # Stage 1: Partition tar files pipeline.add_stage(FilePartitioningStage( file_paths="/path/to/tar_dataset", files_per_partition=1, file_extensions=[".tar"], )) # Stage 2: Read images pipeline.add_stage(ImageReaderStage( batch_size=100, num_threads=8, num_gpus_per_worker=0.25, )) # Stage 3: Generate CLIP embeddings pipeline.add_stage(ImageEmbeddingStage( model_dir="/path/to/models", model_inference_batch_size=32, num_gpus_per_worker=0.25, remove_image_data=False, verbose=True, )) # Run the pipeline (uses XennaExecutor by default) results = pipeline.run() ``` ## Parameters | Parameter | Type | Default | Description | | ---------------------------- | ----- | ------- | ---------------------------------------------------------------------- | | `model_dir` | str | None | Path to directory containing CLIP model weights | | `model_inference_batch_size` | int | 32 | Batch size for model inference | | `num_gpus_per_worker` | float | 0.25 | GPU allocation per worker (0.25 = 1/4 GPU) | | `remove_image_data` | bool | False | Whether to remove image data after embedding generation (saves memory) | | `verbose` | bool | False | Enable verbose logging for debugging | ## Performance Notes * The CLIP model requires GPU acceleration for reasonable performance. * Increase `model_inference_batch_size` for better throughput if GPU memory allows. * Set `remove_image_data=True` if you don't need the raw image data for downstream stages. * The stage automatically handles different image sizes by preprocessing them to 224x224. ## Best Practices * Use GPU-enabled environments for best performance. * Adjust `model_inference_batch_size` based on available GPU memory (start with 32, increase if memory allows). * Set `remove_image_data=True` for memory efficiency if downstream stages only need embeddings. * Monitor GPU utilization and adjust `num_gpus_per_worker` accordingly. ## Output Format After processing, each `ImageObject` will have: ```python ImageObject( image_path="00000.tar/000000031.jpg", image_id="000000031", image_data=np.array(...), # Raw image data (if remove_image_data=False) embedding=np.array(...), # CLIP embedding vector metadata={} ) ``` ## Additional Resources * [Complete Pipeline Example](https://github.com/NVIDIA-NeMo/Curator/blob/main/tutorials/image/getting-started/image_curation_example.py) * [OpenAI CLIP Paper](https://arxiv.org/abs/2103.00020)