Image Curation Concepts#

This document covers the essential concepts for image data curation in NVIDIA NeMo Curator. These concepts assume basic familiarity with data science and machine learning principles.

Core Concept Areas#

Image curation in NVIDIA NeMo Curator focuses on these key areas:

Data Loading

Core concepts for loading and managing image datasets

Data Loading Concepts (Image)
Data Processing

Concepts for embedding generation, classification, filtering, and deduplication

Data Processing Concepts (Image)
Data Export

Concepts for saving, exporting, and resharding curated image datasets

Data Export Concepts (Image)

Infrastructure Components#

The image curation concepts build on NVIDIA NeMo Curator’s core infrastructure components, which are shared across all modalities (text, image, video). These components include:

Distributed Computing

Configure and manage distributed processing across multiple machines

Distributed Computing Reference
Memory Management

Optimize memory usage when processing large datasets

Memory Management Guide
GPU Acceleration

Leverage NVIDIA GPUs for faster data processing

GPU Processing Guide
Resumable Processing

Continue interrupted operations across large datasets

Resumable Processing