About NeMo CuratorConceptsImage ConceptsData

Data Processing Concepts (Image)

View as Markdown

This page covers the core concepts for processing image data in NeMo Curator.

Embedding Generation

Image embeddings are vector representations of images, used for downstream tasks like classification, filtering, and deduplication.

  • ImageEmbeddingStage: Uses CLIP ViT-L/14 model for high-quality embedding generation. Supports GPU acceleration, batching, and automatic CPU fallback.
  • CLIP Integration: Built-in CLIP model provides robust embeddings for aesthetic and NSFW classification.
  • Pipeline Integration: Embedding generation integrates seamlessly into NeMo Curator’s pipeline architecture.

Example:

1from nemo_curator.stages.image.embedders.clip_embedder import ImageEmbeddingStage
2
3# Add to pipeline
4pipeline.add_stage(ImageEmbeddingStage(
5 model_dir="/path/to/models",
6 model_inference_batch_size=32,
7 num_gpus_per_worker=0.25,
8 remove_image_data=False,
9))

Filtering

Image filtering stages score and filter images based on their embeddings in a single operation. These stages must always be run after embedding generation, as they require pre-computed embeddings as input.

  • ImageAestheticFilterStage: Predicts aesthetic scores (0–1) and automatically filters out images below the threshold
  • ImageNSFWFilterStage: Predicts NSFW probability (0–1) and automatically filters out images above the threshold
  • Pipeline Integration: Filtering stages must be run after embedding generation in the same pipeline

Filtering stages combine scoring and filtering in one operation. They take embeddings as input, generate scores using specialized models, and automatically remove images that don’t meet the configured thresholds. Embeddings must be generated by a separate embedding stage first.

Example:

1from nemo_curator.stages.image.filters.aesthetic_filter import ImageAestheticFilterStage
2from nemo_curator.stages.image.filters.nsfw_filter import ImageNSFWFilterStage
3
4# Add to pipeline
5pipeline.add_stage(ImageAestheticFilterStage(
6 model_dir="/path/to/models",
7 score_threshold=0.5,
8 model_inference_batch_size=32,
9))
10
11pipeline.add_stage(ImageNSFWFilterStage(
12 model_dir="/path/to/models",
13 score_threshold=0.5,
14 model_inference_batch_size=32,
15))

How Filtering Works:

  • Aesthetic Filtering: Images with scores below score_threshold are automatically filtered out
  • NSFW Filtering: Images with scores above score_threshold are automatically filtered out
  • Seamless Processing: Filtering happens automatically within the stages—they remove images that don’t meet criteria from the ImageBatch before passing to the next stage

Deduplication

Image deduplication identifies and removes duplicate images using semantic similarity based on embeddings. NeMo Curator provides a complete workflow from embedding-based duplicate detection to image removal.

Semantic Duplicate Detection

  • SemanticDeduplicationWorkflow: Uses embeddings to identify duplicates through clustering and pairwise similarity
  • Embedding-based: Leverages CLIP embeddings generated in previous pipeline stages
  • Configurable Similarity: Control duplicate detection strictness with similarity thresholds
  • Scalable Processing: GPU-accelerated clustering and similarity computation

Example:

1from nemo_curator.stages.deduplication.semantic import SemanticDeduplicationWorkflow
2
3# Run semantic deduplication on embeddings
4dedup_workflow = SemanticDeduplicationWorkflow(
5 input_path="/path/to/embeddings",
6 output_path="/path/to/removal_ids",
7 id_field="image_id",
8 embedding_field="embedding",
9 n_clusters=100,
10 eps=0.01, # Similarity threshold (lower = more strict)
11)
12dedup_workflow.run()

Duplicate Removal

  • ImageDuplicatesRemovalStage: Filters out images based on duplicate IDs identified by semantic deduplication
  • ID-based Removal: Uses image identifiers to remove duplicates efficiently
  • Pipeline Integration: Runs after embedding and classification stages using duplicate IDs from semantic workflow

Example:

1from nemo_curator.stages.image.deduplication.removal import ImageDuplicatesRemovalStage
2
3# Add to pipeline after running semantic deduplication
4pipeline.add_stage(ImageDuplicatesRemovalStage(
5 removal_parquets_dir="/path/to/removal_ids/duplicates",
6 duplicate_id_field="id",
7))

For a complete end-to-end workflow, refer to the Image Duplicate Removal Tutorial.

Pipeline Flow

A typical image curation pipeline using NeMo Curator’s stage-based architecture:

  1. Partition tar files (FilePartitioningStage)
  2. Load images from tar archives (ImageReaderStage)
  3. Generate embeddings (ImageEmbeddingStage) - Required before filtering stages
  4. Filter by aesthetics (ImageAestheticFilterStage) - Requires embeddings from step 3
  5. Filter NSFW content (ImageNSFWFilterStage) - Requires embeddings from step 3
  6. Remove duplicates (ImageDuplicatesRemovalStage) - requires running SemanticDeduplicationWorkflow first

Example:

1from nemo_curator.pipeline import Pipeline
2from nemo_curator.stages.file_partitioning import FilePartitioningStage
3from nemo_curator.stages.image.io.image_reader import ImageReaderStage
4from nemo_curator.stages.image.embedders.clip_embedder import ImageEmbeddingStage
5from nemo_curator.stages.image.filters.aesthetic_filter import ImageAestheticFilterStage
6from nemo_curator.stages.image.filters.nsfw_filter import ImageNSFWFilterStage
7from nemo_curator.stages.image.deduplication.removal import ImageDuplicatesRemovalStage
8
9# Build pipeline
10pipeline.add_stage(FilePartitioningStage(file_paths="/path/to/tars"))
11pipeline.add_stage(ImageReaderStage())
12pipeline.add_stage(ImageEmbeddingStage(model_dir="/path/to/models"))
13pipeline.add_stage(ImageAestheticFilterStage(model_dir="/path/to/models", score_threshold=0.5))
14pipeline.add_stage(ImageNSFWFilterStage(model_dir="/path/to/models", score_threshold=0.5))
15# Optional: Remove duplicates (run semantic deduplication workflow first)
16pipeline.add_stage(ImageDuplicatesRemovalStage(
17 removal_parquets_dir="/path/to/removal_ids/duplicates",
18 duplicate_id_field="id",
19))
20
21# Execute the pipeline
22results = pipeline.run()

This modular pipeline approach allows you to customize or skip stages based on your workflow needs. Filtering stages (aesthetic and NSFW filtering) must always follow embedding generation, as they require pre-computed embeddings as input.

For duplicate removal, you’ll need to run the semantic deduplication workflow separately between embedding generation and the removal stage. See the Image Duplicate Removal Tutorial for the complete three-step process: embedding generation → semantic deduplication → duplicate removal.