Data Processing Concepts (Image)#
This page covers the core concepts for processing image data in NeMo Curator.
Embedding Generation#
Image embeddings are vector representations of images, used for downstream tasks like classification, filtering, and deduplication.
ImageEmbeddingStage: Uses CLIP ViT-L/14 model for high-quality embedding generation. Supports GPU acceleration, batching, and automatic CPU fallback.
CLIP Integration: Built-in CLIP model provides robust embeddings for aesthetic and NSFW classification.
Pipeline Integration: Embedding generation integrates seamlessly into NeMo Curator’s pipeline architecture.
Example:
from nemo_curator.stages.image.embedders.clip_embedder import ImageEmbeddingStage
# Add to pipeline
pipeline.add_stage(ImageEmbeddingStage(
model_dir="/path/to/models",
model_inference_batch_size=32,
num_gpus_per_worker=0.25,
remove_image_data=False,
))
Filtering#
Image filtering stages score and filter images based on their embeddings in a single operation. These stages must always be run after embedding generation, as they require pre-computed embeddings as input.
ImageAestheticFilterStage: Predicts aesthetic scores (0–1) and automatically filters out images below the threshold
ImageNSFWFilterStage: Predicts NSFW probability (0–1) and automatically filters out images above the threshold
Pipeline Integration: Filtering stages must be run after embedding generation in the same pipeline
Note
Filtering stages combine scoring and filtering in one operation. They take embeddings as input, generate scores using specialized models, and automatically remove images that don’t meet the configured thresholds. Embeddings must be generated by a separate embedding stage first.
Example:
from nemo_curator.stages.image.filters.aesthetic_filter import ImageAestheticFilterStage
from nemo_curator.stages.image.filters.nsfw_filter import ImageNSFWFilterStage
# Add to pipeline
pipeline.add_stage(ImageAestheticFilterStage(
model_dir="/path/to/models",
score_threshold=0.5,
model_inference_batch_size=32,
))
pipeline.add_stage(ImageNSFWFilterStage(
model_dir="/path/to/models",
score_threshold=0.5,
model_inference_batch_size=32,
))
How Filtering Works:
Aesthetic Filtering: Images with scores below
score_threshold
are automatically filtered outNSFW Filtering: Images with scores above
score_threshold
are automatically filtered outSeamless Processing: Filtering happens automatically within the stages—they remove images that don’t meet criteria from the
ImageBatch
before passing to the next stage
Deduplication#
Image deduplication identifies and removes duplicate images using semantic similarity based on embeddings. NeMo Curator provides a complete workflow from embedding-based duplicate detection to image removal.
Semantic Duplicate Detection#
SemanticDeduplicationWorkflow: Uses embeddings to identify duplicates through clustering and pairwise similarity
Embedding-based: Leverages CLIP embeddings generated in previous pipeline stages
Configurable Similarity: Control duplicate detection strictness with similarity thresholds
Scalable Processing: GPU-accelerated clustering and similarity computation
Example:
from nemo_curator.stages.deduplication.semantic import SemanticDeduplicationWorkflow
# Run semantic deduplication on embeddings
dedup_workflow = SemanticDeduplicationWorkflow(
input_path="/path/to/embeddings",
output_path="/path/to/removal_ids",
id_field="image_id",
embedding_field="embedding",
n_clusters=100,
eps=0.01, # Similarity threshold (lower = more strict)
)
dedup_workflow.run()
Duplicate Removal#
ImageDuplicatesRemovalStage: Filters out images based on duplicate IDs identified by semantic deduplication
ID-based Removal: Uses image identifiers to remove duplicates efficiently
Pipeline Integration: Runs after embedding and classification stages using duplicate IDs from semantic workflow
Example:
from nemo_curator.stages.image.deduplication.removal import ImageDuplicatesRemovalStage
# Add to pipeline after running semantic deduplication
pipeline.add_stage(ImageDuplicatesRemovalStage(
removal_parquets_dir="/path/to/removal_ids/duplicates",
duplicate_id_field="id",
))
See also
For a complete end-to-end workflow, refer to the Image Duplicate Removal Tutorial.
Pipeline Flow#
A typical image curation pipeline using NeMo Curator’s stage-based architecture:
Partition tar files (
FilePartitioningStage
)Load images from tar archives (
ImageReaderStage
)Generate embeddings (
ImageEmbeddingStage
) - Required before filtering stagesFilter by aesthetics (
ImageAestheticFilterStage
) - Requires embeddings from step 3Filter NSFW content (
ImageNSFWFilterStage
) - Requires embeddings from step 3Remove duplicates (
ImageDuplicatesRemovalStage
) - requires runningSemanticDeduplicationWorkflow
first
Example:
from nemo_curator.stages.file_partitioning import FilePartitioningStage
from nemo_curator.stages.image.io.image_reader import ImageReaderStage
from nemo_curator.stages.image.embedders.clip_embedder import ImageEmbeddingStage
from nemo_curator.stages.image.filters.aesthetic_filter import ImageAestheticFilterStage
from nemo_curator.stages.image.filters.nsfw_filter import ImageNSFWFilterStage
from nemo_curator.stages.image.deduplication.removal import ImageDuplicatesRemovalStage
# Build pipeline
pipeline.add_stage(FilePartitioningStage(file_paths="/path/to/tars"))
pipeline.add_stage(ImageReaderStage())
pipeline.add_stage(ImageEmbeddingStage(model_dir="/path/to/models"))
pipeline.add_stage(ImageAestheticFilterStage(model_dir="/path/to/models", score_threshold=0.5))
pipeline.add_stage(ImageNSFWFilterStage(model_dir="/path/to/models", score_threshold=0.5))
# Optional: Remove duplicates (run semantic deduplication workflow first)
pipeline.add_stage(ImageDuplicatesRemovalStage(
removal_parquets_dir="/path/to/removal_ids/duplicates",
duplicate_id_field="id",
))
This modular pipeline approach allows you to customize or skip stages based on your workflow needs. Filtering stages (aesthetic and NSFW filtering) must always follow embedding generation, as they require pre-computed embeddings as input.
Note
For duplicate removal, you’ll need to run the semantic deduplication workflow separately between embedding generation and the removal stage. See the Image Duplicate Removal Tutorial for the complete three-step process: embedding generation → semantic deduplication → duplicate removal.