CLIP ImageEmbeddingStage#
The ImageEmbeddingStage
generates CLIP embeddings for images using OpenAI’s ViT-L/14 model. These embeddings are essential for downstream tasks such as aesthetic filtering, NSFW detection, and semantic deduplication.
Model Details#
Architecture: OpenAI CLIP ViT-L/14 model
Output Field:
embedding
(stored inImageObject.embedding
)Embedding Dimension: Generated by ViT-L/14 model
Input Requirements: RGB images loaded by
ImageReaderStage
How It Works#
The stage processes ImageBatch
objects containing ImageObject
instances with loaded image data. It applies CLIP preprocessing, generates embeddings in batches, and stores the results in each ImageObject.embedding
attribute.
Usage#
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.file_partitioning import FilePartitioningStage
from nemo_curator.stages.image.io.image_reader import ImageReaderStage
from nemo_curator.stages.image.embedders.clip_embedder import ImageEmbeddingStage
# Create pipeline
pipeline = Pipeline(name="image_embedding", description="Generate CLIP embeddings for images")
# Stage 1: Partition tar files
pipeline.add_stage(FilePartitioningStage(
file_paths="/path/to/tar_dataset",
files_per_partition=1,
file_extensions=[".tar"],
))
# Stage 2: Read images
pipeline.add_stage(ImageReaderStage(
task_batch_size=100,
num_gpus_per_worker=0.25,
))
# Stage 3: Generate CLIP embeddings
pipeline.add_stage(ImageEmbeddingStage(
model_dir="/path/to/models",
model_inference_batch_size=32,
num_gpus_per_worker=0.25,
remove_image_data=False,
verbose=True,
))
# Run the pipeline (uses XennaExecutor by default)
results = pipeline.run()
Parameters#
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
str |
None |
Path to directory containing CLIP model weights |
|
int |
32 |
Batch size for model inference |
|
float |
0.25 |
GPU allocation per worker (0.25 = 1/4 GPU) |
|
bool |
False |
Whether to remove image data after embedding generation (saves memory) |
|
bool |
False |
Enable verbose logging for debugging |
Performance Notes#
The CLIP model requires GPU acceleration for reasonable performance.
Increase
model_inference_batch_size
for better throughput if GPU memory allows.Set
remove_image_data=True
if you don’t need the raw image data for downstream stages.The stage automatically handles different image sizes by preprocessing them to 224x224.
Best Practices#
Use GPU-enabled environments for best performance.
Adjust
model_inference_batch_size
based on available GPU memory (start with 32, increase if memory allows).Set
remove_image_data=True
for memory efficiency if downstream stages only need embeddings.Monitor GPU utilization and adjust
num_gpus_per_worker
accordingly.
Output Format#
After processing, each ImageObject
will have:
ImageObject(
image_path="00000.tar/000000031.jpg",
image_id="000000031",
image_data=np.array(...), # Raw image data (if remove_image_data=False)
embedding=np.array(...), # CLIP embedding vector
metadata={}
)