Curate ImagesProcess DataEmbeddings

CLIP ImageEmbeddingStage

View as Markdown

The ImageEmbeddingStage generates CLIP embeddings for images using OpenAI’s ViT-L/14 model. These embeddings are essential for downstream tasks such as aesthetic filtering, NSFW detection, and semantic deduplication.

Model Details

  • Architecture: OpenAI CLIP ViT-L/14 model
  • Output Field: embedding (stored in ImageObject.embedding)
  • Embedding Dimension: Generated by ViT-L/14 model
  • Input Requirements: RGB images loaded by ImageReaderStage

How It Works

The stage processes ImageBatch objects containing ImageObject instances with loaded image data. It applies CLIP preprocessing, generates embeddings in batches, and stores the results in each ImageObject.embedding attribute.

Prerequisites

Before using the ImageEmbeddingStage, ensure you have:

Model Setup

The CLIP model weights are automatically downloaded from HuggingFace on first use. The stage will:

  1. Download the OpenAI CLIP ViT-L/14 model (~3.5GB) to the specified model_dir
  2. Cache the model for subsequent runs
  3. Load the model onto GPU (or CPU if GPU unavailable)

First-time setup: The initial model download may take several minutes depending on your internet connection. Subsequent runs will use the cached model.

System Requirements

  • GPU: NVIDIA GPU with CUDA support (recommended for performance)
  • Memory: At least 8GB GPU memory for batch processing
  • Disk Space: ~4GB for model weights
  • Python Dependencies: PyTorch, transformers (installed with NeMo Curator)

Usage

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.stages.file_partitioning import FilePartitioningStage
3from nemo_curator.stages.image.io.image_reader import ImageReaderStage
4from nemo_curator.stages.image.embedders.clip_embedder import ImageEmbeddingStage
5
6# Create pipeline
7pipeline = Pipeline(name="image_embedding", description="Generate CLIP embeddings for images")
8
9# Stage 1: Partition tar files
10pipeline.add_stage(FilePartitioningStage(
11 file_paths="/path/to/tar_dataset",
12 files_per_partition=1,
13 file_extensions=[".tar"],
14))
15
16# Stage 2: Read images
17pipeline.add_stage(ImageReaderStage(
18 batch_size=100,
19 num_threads=8,
20 num_gpus_per_worker=0.25,
21))
22
23# Stage 3: Generate CLIP embeddings
24pipeline.add_stage(ImageEmbeddingStage(
25 model_dir="/path/to/models",
26 model_inference_batch_size=32,
27 num_gpus_per_worker=0.25,
28 remove_image_data=False,
29 verbose=True,
30))
31
32# Run the pipeline (uses XennaExecutor by default)
33results = pipeline.run()

Parameters

ParameterTypeDefaultDescription
model_dirstrNonePath to directory containing CLIP model weights
model_inference_batch_sizeint32Batch size for model inference
num_gpus_per_workerfloat0.25GPU allocation per worker (0.25 = 1/4 GPU)
remove_image_databoolFalseWhether to remove image data after embedding generation (saves memory)
verboseboolFalseEnable verbose logging for debugging

Performance Notes

  • The CLIP model requires GPU acceleration for reasonable performance.
  • Increase model_inference_batch_size for better throughput if GPU memory allows.
  • Set remove_image_data=True if you don’t need the raw image data for downstream stages.
  • The stage automatically handles different image sizes by preprocessing them to 224x224.

Best Practices

  • Use GPU-enabled environments for best performance.
  • Adjust model_inference_batch_size based on available GPU memory (start with 32, increase if memory allows).
  • Set remove_image_data=True for memory efficiency if downstream stages only need embeddings.
  • Monitor GPU utilization and adjust num_gpus_per_worker accordingly.

Output Format

After processing, each ImageObject will have:

1ImageObject(
2 image_path="00000.tar/000000031.jpg",
3 image_id="000000031",
4 image_data=np.array(...), # Raw image data (if remove_image_data=False)
5 embedding=np.array(...), # CLIP embedding vector
6 metadata={}
7)

Additional Resources