For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI Reference
DocumentationAPI Reference
  • Home
    • Welcome
  • About NeMo Curator
    • Overview
    • Key Features
  • Get Started
    • Overview
    • Install (All Modalities)
    • Text Quickstart
    • Image Quickstart
    • Video Quickstart
    • Audio Quickstart
  • Curate Text
    • Overview
    • Tutorials
    • Save and Export
  • Curate Images
    • Overview
    • Save and Export
  • Curate Video
    • Overview
    • Load Data
    • Save and Export
  • Curate Audio
    • Overview
    • Save and Export
  • Setup & Deployment
    • Overview
  • Reference
    • Overview
    • Related Tools
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Curator
On this page
  • Use Cases
  • Architecture
  • Introduction
  • Curation Tasks
  • Load Data
  • Process Data
  • Pipeline Management
  • Save & Export
Curate Images

About Image Curation

||View as Markdown|
Previous

Task Reference

Next

Overview

Learn how to curate high-quality image datasets using NeMo Curator’s powerful image processing pipeline. NeMo Curator enables you to efficiently process large-scale image-text datasets, applying quality filtering, content filtering, and semantic deduplication at scale.

Use Cases

  • Prepare high-quality image datasets for training generative AI models such as LLMs, VLMs, and WFMs
  • Curate datasets for text-to-image model training and fine-tuning
  • Process large-scale image collections for multimodal foundation model pretraining
  • Apply quality control and content filtering to remove inappropriate or low-quality images
  • Generate embeddings and semantic features for image search and retrieval applications
  • Remove duplicate images from large datasets using semantic deduplication

Architecture

NeMo Curator’s image curation follows a modular pipeline architecture where data flows through configurable stages. Each stage performs a specific operation and passes processed data to the next stage in the pipeline.

This pipeline architecture provides:

  • Modularity: Add, remove, or reorder stages based on your workflow needs
  • Scalability: Distributed processing across multiple GPUs and nodes using Ray
  • Flexibility: Configure parameters for each stage independently
  • Efficiency: GPU-accelerated processing with DALI and CLIP models

Introduction

Master the fundamentals of NeMo Curator’s image curation pipeline and set up your processing environment.

Concepts

Learn about ImageBatch, ImageObject, and pipeline stages for efficient image curation data-structures distributed architecture

Get Started

Learn prerequisites, setup instructions, and initial configuration for image curation setup configuration quickstart

Curation Tasks

Load Data

Load and process large-scale image datasets from local storage using tar archives with GPU-accelerated DALI for efficient distributed processing.

Tar Archives

Load and process JPEG images from tar archives using DALI tar-archives dali gpu-accelerated

Process Data

Transform and enhance your image data through embeddings, classification, and filters.

Embeddings

Generate image embeddings using CLIP models. embeddings

Filters

Apply built-in filters for aesthetic quality and NSFW content filtering. Aesthetic NSFW quality filtering

Deduplication

Remove duplicate images using semantic similarity and clustering. deduplication semantic clustering

Pipeline Management

Optimize and manage your image curation pipelines with advanced execution backends and resource management.

Execution Backends

Configure Ray-based executors for distributed processing and resource management. ray distributed resource-management

Performance Optimization

Optimize performance with DALI GPU acceleration and efficient resource allocation. dali gpu-acceleration performance

Save & Export

Export your curated image datasets with metadata preservation, custom resharding options, and support for downstream training pipelines.

Save & Export

Save metadata to Parquet and export filtered datasets with custom resharding. parquet tar-archives resharding