*** description: >- Essential concepts for video data curation including distributed processing, pipeline stages, and execution modes categories: * concepts-architecture tags: * concepts * video-curation * distributed * pipeline * ray * autoscaling personas: * data-scientist-focused * mle-focused difficulty: beginner content\_type: concept modality: video-only *** # Video Curation Concepts This document covers the essential concepts for video data curation in NVIDIA NeMo Curator. These concepts assume basic familiarity with data science and machine learning principles. ## Core Concept Areas Video curation in NVIDIA NeMo Curator focuses on these key areas: Core concepts for distributed processing, Ray foundation, and auto-scaling Stages, pipelines, and execution modes in video curation workflows How data moves through the system from ingestion to output ## Notes on Modalities and Backends Video pipelines in Curator run on Ray with the `XennaExecutor` integration for streaming and batch execution. Other modalities, such as text and image, also use RAPIDS and Curator’s distributed backends in parts of their workflows. Refer to the modality-specific guides for details. ## Infrastructure Components The video curation concepts build on NVIDIA NeMo Curator's core infrastructure components. All modalities (text, image, video, and audio) use these components. These components include: Optimize memory usage for large datasets partitioning batching monitoring Leverage NVIDIA GPU acceleration for faster data processing cuda rmm performance Continue interrupted operations on large datasets checkpoints recovery batching