CLIP Embedder | NeMo Curator

The ImageEmbeddingStage generates CLIP embeddings for images using OpenAI’s ViT-L/14 model. These embeddings are essential for downstream tasks such as aesthetic filtering, NSFW detection, and semantic deduplication.

Model Details

Architecture: OpenAI CLIP ViT-L/14 model
Output Field: embedding (stored in ImageObject.embedding)
Embedding Dimension: Generated by ViT-L/14 model
Input Requirements: RGB images loaded by ImageReaderStage

How It Works

The stage processes ImageBatch objects containing ImageObject instances with loaded image data. It applies CLIP preprocessing, generates embeddings in batches, and stores the results in each ImageObject.embedding attribute.

Prerequisites

Before using the ImageEmbeddingStage, ensure you have:

Model Setup

The CLIP model weights are automatically downloaded from HuggingFace on first use. The stage will:

Download the OpenAI CLIP ViT-L/14 model (~3.5GB) to the specified model_dir
Cache the model for subsequent runs
Load the model onto GPU (or CPU if GPU unavailable)

First-time setup: The initial model download may take several minutes depending on your internet connection. Subsequent runs will use the cached model.

System Requirements

GPU: NVIDIA GPU with CUDA support (recommended for performance)
Memory: At least 8GB GPU memory for batch processing
Disk Space: ~4GB for model weights
Python Dependencies: PyTorch, transformers (installed with NeMo Curator)

Usage

1 from nemo_curator.pipeline import Pipeline
2 from nemo_curator.stages.file_partitioning import FilePartitioningStage
3 from nemo_curator.stages.image.io.image_reader import ImageReaderStage
4 from nemo_curator.stages.image.embedders.clip_embedder import ImageEmbeddingStage
5 
6 # Create pipeline
7 pipeline = Pipeline(name="image_embedding", description="Generate CLIP embeddings for images")
8 
9 # Stage 1: Partition tar files
10 pipeline.add_stage(FilePartitioningStage(
11     file_paths="/path/to/tar_dataset",
12     files_per_partition=1,
13     file_extensions=[".tar"],
14 ))
15 
16 # Stage 2: Read images
17 pipeline.add_stage(ImageReaderStage(
18     dali_batch_size=100,
19     num_threads=8,
20     num_gpus_per_worker=0.25,
21 ))
22 
23 # Stage 3: Generate CLIP embeddings
24 pipeline.add_stage(ImageEmbeddingStage(
25     model_dir="/path/to/models",
26     model_inference_batch_size=32,
27     num_gpus_per_worker=0.25,
28     remove_image_data=False,
29     verbose=True,
30 ))
31 
32 # Run the pipeline (uses XennaExecutor by default)
33 results = pipeline.run()

Parameters

Parameter	Type	Default	Description
`model_dir`	str	None	Path to directory containing CLIP model weights
`model_inference_batch_size`	int	32	Batch size for model inference
`num_gpus_per_worker`	float	0.25	GPU allocation per worker (0.25 = 1/4 GPU)
`remove_image_data`	bool	False	Whether to remove image data after embedding generation (saves memory)
`verbose`	bool	False	Enable verbose logging for debugging

Performance Notes

The CLIP model requires GPU acceleration for reasonable performance.
Increase model_inference_batch_size for better throughput if GPU memory allows.
Set remove_image_data=True if you don’t need the raw image data for downstream stages.
The stage automatically handles different image sizes by preprocessing them to 224x224.

Best Practices

Use GPU-enabled environments for best performance.
Adjust model_inference_batch_size based on available GPU memory (start with 32, increase if memory allows).
Set remove_image_data=True for memory efficiency if downstream stages only need embeddings.
Monitor GPU utilization and adjust num_gpus_per_worker accordingly.

Output Format

After processing, each ImageObject will have:

1 ImageObject(
2     image_path="00000.tar/000000031.jpg",
3     image_id="000000031",
4     image_data=np.array(...),  # Raw image data (if remove_image_data=False)
5     embedding=np.array(...),   # CLIP embedding vector
6     metadata={}
7 )

CLIP ImageEmbeddingStage