CLIP ImageEmbeddingStage#

The ImageEmbeddingStage generates CLIP embeddings for images using OpenAI’s ViT-L/14 model. These embeddings are essential for downstream tasks such as aesthetic filtering, NSFW detection, and semantic deduplication.

Model Details#

  • Architecture: OpenAI CLIP ViT-L/14 model

  • Output Field: embedding (stored in ImageObject.embedding)

  • Embedding Dimension: Generated by ViT-L/14 model

  • Input Requirements: RGB images loaded by ImageReaderStage

How It Works#

The stage processes ImageBatch objects containing ImageObject instances with loaded image data. It applies CLIP preprocessing, generates embeddings in batches, and stores the results in each ImageObject.embedding attribute.

Usage#

from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.file_partitioning import FilePartitioningStage
from nemo_curator.stages.image.io.image_reader import ImageReaderStage
from nemo_curator.stages.image.embedders.clip_embedder import ImageEmbeddingStage

# Create pipeline
pipeline = Pipeline(name="image_embedding", description="Generate CLIP embeddings for images")

# Stage 1: Partition tar files
pipeline.add_stage(FilePartitioningStage(
    file_paths="/path/to/tar_dataset",
    files_per_partition=1,
    file_extensions=[".tar"],
))

# Stage 2: Read images
pipeline.add_stage(ImageReaderStage(
    task_batch_size=100,
    num_gpus_per_worker=0.25,
))

# Stage 3: Generate CLIP embeddings
pipeline.add_stage(ImageEmbeddingStage(
    model_dir="/path/to/models",
    model_inference_batch_size=32,
    num_gpus_per_worker=0.25,
    remove_image_data=False,
    verbose=True,
))

# Run the pipeline (uses XennaExecutor by default)
results = pipeline.run()

Parameters#

Parameter

Type

Default

Description

model_dir

str

None

Path to directory containing CLIP model weights

model_inference_batch_size

int

32

Batch size for model inference

num_gpus_per_worker

float

0.25

GPU allocation per worker (0.25 = 1/4 GPU)

remove_image_data

bool

False

Whether to remove image data after embedding generation (saves memory)

verbose

bool

False

Enable verbose logging for debugging

Performance Notes#

  • The CLIP model requires GPU acceleration for reasonable performance.

  • Increase model_inference_batch_size for better throughput if GPU memory allows.

  • Set remove_image_data=True if you don’t need the raw image data for downstream stages.

  • The stage automatically handles different image sizes by preprocessing them to 224x224.

Best Practices#

  • Use GPU-enabled environments for best performance.

  • Adjust model_inference_batch_size based on available GPU memory (start with 32, increase if memory allows).

  • Set remove_image_data=True for memory efficiency if downstream stages only need embeddings.

  • Monitor GPU utilization and adjust num_gpus_per_worker accordingly.

Output Format#

After processing, each ImageObject will have:

ImageObject(
    image_path="00000.tar/000000031.jpg",
    image_id="000000031",
    image_data=np.array(...),  # Raw image data (if remove_image_data=False)
    embedding=np.array(...),   # CLIP embedding vector
    metadata={}
)

Additional Resources#