Image Curation Concepts#
This document covers the essential concepts for image data curation in NVIDIA NeMo Curator. These concepts assume basic familiarity with data science and machine learning principles.
Core Concept Areas#
Image curation in NVIDIA NeMo Curator focuses on these key areas:
Core concepts for loading and managing image datasets
Concepts for embedding generation, classification, filtering, and deduplication
Concepts for saving, exporting, and resharding curated image datasets
Infrastructure Components#
The image curation concepts build on NVIDIA NeMo Curator’s core infrastructure components, which are shared across all modalities (text, image, video). These components include:
Configure and manage distributed processing across multiple machines
Optimize memory usage when processing large datasets
Leverage NVIDIA GPUs for faster data processing
Continue interrupted operations across large datasets