Image Duplicate Removal Workflow
Learn how to run a complete image duplicate removal workflow that generates embeddings, identifies semantic duplicates, and removes similar images from your dataset.
Before You Start
- Complete the Get Started guide.
- Understand basic pipeline concepts from the Image Beginner Tutorial.
1. Generate Image Embeddings
Create CLIP embeddings for all images in your dataset. This pipeline reads images, generates embeddings, and saves them to Parquet format for duplicate removal processing.
Define the Embedding Pipeline
Run Embedding Generation
Embedding Format Example
The pipeline writes embeddings to Parquet with two columns:
- image_id: String identifier for the image
- embedding: List of float values with length 768 (CLIP ViT-L/14 dimension)
Directory layout
Schema
Sample row
Read example
2. Run Semantic Duplicate Removal
Use the semantic duplicate removal workflow to identify and mark duplicate images based on embedding similarity.
3. Remove Duplicate Images
After identifying duplicates, use ImageDuplicatesRemovalStage to filter them from your dataset.
Filter the original dataset to remove identified duplicates and create the final deduplicated dataset.
Run the Removal Pipeline
4. Inspect Results
After deduplication, examine the results to understand what was removed:
Check Removal Statistics
Compare Dataset Sizes
5. Complete Workflow Script
Here’s the complete workflow that combines all steps:
Next Steps
After running image deduplication:
- Quality assessment: Manually review a sample of removed duplicates
- Combine with filtering: Run aesthetic/NSFW filtering on deduplicated data
- Export for training: Prepare final curated dataset for ML training
- Monitor metrics: Track deduplication rates across different image types