Data Processing Concepts (Image)
Data Processing Concepts (Image)
Data Processing Concepts (Image)
This page covers the core concepts for processing image data in NeMo Curator.
Image embeddings are vector representations of images, used for downstream tasks like classification, filtering, and deduplication.
Example:
Image filtering stages score and filter images based on their embeddings in a single operation. These stages must always be run after embedding generation, as they require pre-computed embeddings as input.
Filtering stages combine scoring and filtering in one operation. They take embeddings as input, generate scores using specialized models, and automatically remove images that don’t meet the configured thresholds. Embeddings must be generated by a separate embedding stage first.
Example:
How Filtering Works:
score_threshold are automatically filtered outscore_threshold are automatically filtered outImageBatch before passing to the next stageImage deduplication identifies and removes duplicate images using semantic similarity based on embeddings. NeMo Curator provides a complete workflow from embedding-based duplicate detection to image removal.
Example:
Example:
For a complete end-to-end workflow, refer to the Image Duplicate Removal Tutorial.
A typical image curation pipeline using NeMo Curator’s stage-based architecture:
FilePartitioningStage)ImageReaderStage)ImageEmbeddingStage) - Required before filtering stagesImageAestheticFilterStage) - Requires embeddings from step 3ImageNSFWFilterStage) - Requires embeddings from step 3ImageDuplicatesRemovalStage) - requires running SemanticDeduplicationWorkflow firstExample:
This modular pipeline approach allows you to customize or skip stages based on your workflow needs. Filtering stages (aesthetic and NSFW filtering) must always follow embedding generation, as they require pre-computed embeddings as input.
For duplicate removal, you’ll need to run the semantic deduplication workflow separately between embedding generation and the removal stage. See the Image Duplicate Removal Tutorial for the complete three-step process: embedding generation → semantic deduplication → duplicate removal.