> For clean Markdown of any page, append .md to the page URL. > For a complete documentation index, see https://docs.nvidia.com/nemo/curator/llms.txt. > For full documentation content, see https://docs.nvidia.com/nemo/curator/llms-full.txt. > Core concepts and terminology for NeMo Curator across text, image, video, and audio data curation modalities # Concepts Learn about the core components and concepts introduced by NeMo Curator. ## Modality Concepts Learn about working with specific modalities using NeMo Curator. Learn about text data curation, covering data loading and processing (filtering, classification, deduplication). Explore key concepts for image data curation, including scalable loading, processing (embedding, classification, filtering), and dataset export. Discover video data curation concepts, such as distributed processing, pipeline stages, execution modes, and efficient data flow. Learn about speech data curation, ASR inference, quality assessment, and audio-text integration workflows. ## Universal Concepts Core concepts that apply across all modalities in NeMo Curator. Comprehensive overview of deduplication techniques across text, image, and video modalities including exact, fuzzy, and semantic approaches. How NeMo Curator allocates CPUs, GPUs, and memory across pipeline stages for optimal hardware utilization. How streaming execution processes data in batches for constant memory usage and higher GPU utilization. How the executor automatically balances resources across pipeline stages to eliminate bottlenecks. Scale from a single GPU to multi-node clusters with near-linear scaling.