Data Processing Concepts (Image)
This page covers the core concepts for processing image data in NeMo Curator.
Embedding Generation
Image embeddings are vector representations of images, used for downstream tasks like classification, filtering, and deduplication.
- ImageEmbeddingStage: Uses CLIP ViT-L/14 model for high-quality embedding generation. Supports GPU acceleration, batching, and automatic CPU fallback.
- CLIP Integration: Built-in CLIP model provides robust embeddings for aesthetic and NSFW classification.
- Pipeline Integration: Embedding generation integrates seamlessly into NeMo Curator’s pipeline architecture.
Example:
Filtering
Image filtering stages score and filter images based on their embeddings in a single operation. These stages must always be run after embedding generation, as they require pre-computed embeddings as input.
- ImageAestheticFilterStage: Predicts aesthetic scores (0–1) and automatically filters out images below the threshold
- ImageNSFWFilterStage: Predicts NSFW probability (0–1) and automatically filters out images above the threshold
- Pipeline Integration: Filtering stages must be run after embedding generation in the same pipeline
Filtering stages combine scoring and filtering in one operation. They take embeddings as input, generate scores using specialized models, and automatically remove images that don’t meet the configured thresholds. Embeddings must be generated by a separate embedding stage first.
Example:
How Filtering Works:
- Aesthetic Filtering: Images with scores below
score_thresholdare automatically filtered out - NSFW Filtering: Images with scores above
score_thresholdare automatically filtered out - Seamless Processing: Filtering happens automatically within the stages—they remove images that don’t meet criteria from the
ImageBatchbefore passing to the next stage
Deduplication
Image deduplication identifies and removes duplicate images using semantic similarity based on embeddings. NeMo Curator provides a complete workflow from embedding-based duplicate detection to image removal.
Semantic Duplicate Detection
- SemanticDeduplicationWorkflow: Uses embeddings to identify duplicates through clustering and pairwise similarity
- Embedding-based: Leverages CLIP embeddings generated in previous pipeline stages
- Configurable Similarity: Control duplicate detection strictness with similarity thresholds
- Scalable Processing: GPU-accelerated clustering and similarity computation
Example:
Duplicate Removal
- ImageDuplicatesRemovalStage: Filters out images based on duplicate IDs identified by semantic deduplication
- ID-based Removal: Uses image identifiers to remove duplicates efficiently
- Pipeline Integration: Runs after embedding and classification stages using duplicate IDs from semantic workflow
Example:
For a complete end-to-end workflow, refer to the Image Duplicate Removal Tutorial.
Pipeline Flow
A typical image curation pipeline using NeMo Curator’s stage-based architecture:
- Partition tar files (
FilePartitioningStage) - Load images from tar archives (
ImageReaderStage) - Generate embeddings (
ImageEmbeddingStage) - Required before filtering stages - Filter by aesthetics (
ImageAestheticFilterStage) - Requires embeddings from step 3 - Filter NSFW content (
ImageNSFWFilterStage) - Requires embeddings from step 3 - Remove duplicates (
ImageDuplicatesRemovalStage) - requires runningSemanticDeduplicationWorkflowfirst
Example:
This modular pipeline approach allows you to customize or skip stages based on your workflow needs. Filtering stages (aesthetic and NSFW filtering) must always follow embedding generation, as they require pre-computed embeddings as input.
For duplicate removal, you’ll need to run the semantic deduplication workflow separately between embedding generation and the removal stage. See the Image Duplicate Removal Tutorial for the complete three-step process: embedding generation → semantic deduplication → duplicate removal.