This document covers the essential concepts for image data curation in NVIDIA NeMo Curator. These concepts assume basic familiarity with data science and machine learning principles.
Image curation in NVIDIA NeMo Curator focuses on these key areas:
Core concepts for loading and managing image datasets
Concepts for embedding generation, classification, filtering, and deduplication
Concepts for saving, exporting, and resharding curated image datasets
The image curation concepts build on NVIDIA NeMo Curator’s core infrastructure components, which are shared across all modalities (text, image, video). These components include: