***
description: >-
Overview of image data curation with NeMo Curator including loading,
processing, filtering, and export workflows
categories:
* workflows
tags:
* image-curation
* tar-archives
* filtering
* embedding
* workflows
personas:
* data-scientist-focused
* mle-focused
difficulty: beginner
content\_type: workflow
modality: image-only
***
# About Image Curation
Learn how to curate high-quality image datasets using NeMo Curator's powerful image processing pipeline. NeMo Curator enables you to efficiently process large-scale image-text datasets, applying quality filtering, content filtering, and semantic deduplication at scale.
## Use Cases
* Prepare high-quality image datasets for training generative AI models such as LLMs, VLMs, and WFMs
* Curate datasets for text-to-image model training and fine-tuning
* Process large-scale image collections for multimodal foundation model pretraining
* Apply quality control and content filtering to remove inappropriate or low-quality images
* Generate embeddings and semantic features for image search and retrieval applications
* Remove duplicate images from large datasets using semantic deduplication
## Architecture
NeMo Curator's image curation follows a modular pipeline architecture where data flows through configurable stages. Each stage performs a specific operation and passes processed data to the next stage in the pipeline.
```mermaid
flowchart LR
A[Tar Archive Input] --> B[File Partitioning]
B --> C[Image Reader
DALI GPU-accelerated]
C --> D[CLIP Embeddings
ViT-L/14]
D --> E[Aesthetic Filtering
Quality scoring]
E --> F[NSFW Filtering
Content filtering]
F --> G[Duplicate Removal
Semantic deduplication]
G --> H[Export & Sharding
Tar + Parquet output]
classDef input fill:#e1f5fe,stroke:#0277bd,color:#000
classDef processing fill:#f3e5f5,stroke:#7b1fa2,color:#000
classDef output fill:#e8f5e8,stroke:#2e7d32,color:#000
class A input
class B,C,D,E,F,G processing
class H output
```
This pipeline architecture provides:
* **Modularity**: Add, remove, or reorder stages based on your workflow needs
* **Scalability**: Distributed processing across multiple GPUs and nodes using Ray
* **Flexibility**: Configure parameters for each stage independently
* **Efficiency**: GPU-accelerated processing with DALI and CLIP models
## Introduction
Master the fundamentals of NeMo Curator's image curation pipeline and set up your processing environment.
Learn about ImageBatch, ImageObject, and pipeline stages for efficient image curation
data-structures
distributed
architecture
Learn prerequisites, setup instructions, and initial configuration for image curation
setup
configuration
quickstart
## Curation Tasks
### Load Data
Load and process large-scale image datasets from local storage using tar archives with GPU-accelerated DALI for efficient distributed processing.
Load and process JPEG images from tar archives using DALI
tar-archives
dali
gpu-accelerated
### Process Data
Transform and enhance your image data through embeddings, classification, and filters.
Generate image embeddings using CLIP models.
embeddings
Apply built-in filters for aesthetic quality and NSFW content filtering.
Aesthetic NSFW quality filtering
Remove duplicate images using semantic similarity and clustering.
deduplication semantic clustering
### Pipeline Management
Optimize and manage your image curation pipelines with advanced execution backends and resource management.
Configure Ray-based executors for distributed processing and resource management.
ray distributed resource-management
Optimize performance with DALI GPU acceleration and efficient resource allocation.
dali gpu-acceleration performance
### Save & Export
Export your curated image datasets with metadata preservation, custom resharding options, and support for downstream training pipelines.
Save metadata to Parquet and export filtered datasets with custom resharding.
parquet tar-archives resharding