***

description: >-
Generate CLIP embeddings for images using OpenAI's ViT-L/14 model for
downstream classification and filtering tasks
categories:

* how-to-guides
  tags:
* embedding
* clip
* vit
* gpu-accelerated
* pipeline-stage
  personas:
* data-scientist-focused
* mle-focused
  difficulty: intermediate
  content\_type: how-to
  modality: image-only

***

# CLIP ImageEmbeddingStage

The `ImageEmbeddingStage` generates CLIP embeddings for images using OpenAI's ViT-L/14 model. These embeddings are essential for downstream tasks such as aesthetic filtering, NSFW detection, and semantic deduplication.

## Model Details

* **Architecture:** [OpenAI CLIP ViT-L/14 model](https://huggingface.co/openai/clip-vit-large-patch14)
* **Output Field:** `embedding` (stored in `ImageObject.embedding`)
* **Embedding Dimension:** Generated by ViT-L/14 model
* **Input Requirements:** RGB images loaded by `ImageReaderStage`

## How It Works

The stage processes `ImageBatch` objects containing `ImageObject` instances with loaded image data. It applies CLIP preprocessing, generates embeddings in batches, and stores the results in each `ImageObject.embedding` attribute.

## Prerequisites

Before using the `ImageEmbeddingStage`, ensure you have:

### Model Setup

The CLIP model weights are automatically downloaded from HuggingFace on first use. The stage will:

1. Download the OpenAI CLIP ViT-L/14 model (\~3.5GB) to the specified `model_dir`
2. Cache the model for subsequent runs
3. Load the model onto GPU (or CPU if GPU unavailable)

**First-time setup:** The initial model download may take several minutes depending on your internet connection. Subsequent runs will use the cached model.

### System Requirements

* **GPU:** NVIDIA GPU with CUDA support (recommended for performance)
* **Memory:** At least 8GB GPU memory for batch processing
* **Disk Space:** \~4GB for model weights
* **Python Dependencies:** PyTorch, transformers (installed with NeMo Curator)

## Usage

```python
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.file_partitioning import FilePartitioningStage
from nemo_curator.stages.image.io.image_reader import ImageReaderStage
from nemo_curator.stages.image.embedders.clip_embedder import ImageEmbeddingStage

# Create pipeline
pipeline = Pipeline(name="image_embedding", description="Generate CLIP embeddings for images")

# Stage 1: Partition tar files
pipeline.add_stage(FilePartitioningStage(
    file_paths="/path/to/tar_dataset",
    files_per_partition=1,
    file_extensions=[".tar"],
))

# Stage 2: Read images
pipeline.add_stage(ImageReaderStage(
    batch_size=100,
    num_threads=8,
    num_gpus_per_worker=0.25,
))

# Stage 3: Generate CLIP embeddings
pipeline.add_stage(ImageEmbeddingStage(
    model_dir="/path/to/models",
    model_inference_batch_size=32,
    num_gpus_per_worker=0.25,
    remove_image_data=False,
    verbose=True,
))

# Run the pipeline (uses XennaExecutor by default)
results = pipeline.run()
```

## Parameters

| Parameter                    | Type  | Default | Description                                                            |
| ---------------------------- | ----- | ------- | ---------------------------------------------------------------------- |
| `model_dir`                  | str   | None    | Path to directory containing CLIP model weights                        |
| `model_inference_batch_size` | int   | 32      | Batch size for model inference                                         |
| `num_gpus_per_worker`        | float | 0.25    | GPU allocation per worker (0.25 = 1/4 GPU)                             |
| `remove_image_data`          | bool  | False   | Whether to remove image data after embedding generation (saves memory) |
| `verbose`                    | bool  | False   | Enable verbose logging for debugging                                   |

## Performance Notes

* The CLIP model requires GPU acceleration for reasonable performance.
* Increase `model_inference_batch_size` for better throughput if GPU memory allows.
* Set `remove_image_data=True` if you don't need the raw image data for downstream stages.
* The stage automatically handles different image sizes by preprocessing them to 224x224.

## Best Practices

* Use GPU-enabled environments for best performance.
* Adjust `model_inference_batch_size` based on available GPU memory (start with 32, increase if memory allows).
* Set `remove_image_data=True` for memory efficiency if downstream stages only need embeddings.
* Monitor GPU utilization and adjust `num_gpus_per_worker` accordingly.

## Output Format

After processing, each `ImageObject` will have:

```python
ImageObject(
    image_path="00000.tar/000000031.jpg",
    image_id="000000031",
    image_data=np.array(...),  # Raw image data (if remove_image_data=False)
    embedding=np.array(...),   # CLIP embedding vector
    metadata={}
)
```

## Additional Resources

* [Complete Pipeline Example](https://github.com/NVIDIA-NeMo/Curator/blob/main/tutorials/image/getting-started/image_curation_example.py)
* [OpenAI CLIP Paper](https://arxiv.org/abs/2103.00020)
