Learn how to run a complete image duplicate removal workflow that generates embeddings, identifies semantic duplicates, and removes similar images from your dataset.
Create CLIP embeddings for all images in your dataset. This pipeline reads images, generates embeddings, and saves them to Parquet format for duplicate removal processing.
The pipeline writes embeddings to Parquet with two columns:
Use the semantic duplicate removal workflow to identify and mark duplicate images based on embedding similarity.
After identifying duplicates, use ImageDuplicatesRemovalStage to filter them from your dataset.
Filter the original dataset to remove identified duplicates and create the final deduplicated dataset.
After deduplication, examine the results to understand what was removed:
Here’s the complete workflow that combines all steps:
After running image deduplication: