Overview | NeMo Curator

This document covers the essential concepts for video data curation in NVIDIA NeMo Curator. These concepts assume basic familiarity with data science and machine learning principles.

Core Concept Areas

Video curation in NVIDIA NeMo Curator focuses on these key areas:

Architecture

Core concepts for distributed processing, Ray foundation, and auto-scaling

Key Abstractions

Stages, pipelines, and execution modes in video curation workflows

Data Flow

How data moves through the system from ingestion to output

Notes on Modalities and Backends

Video pipelines in Curator run on Ray with the XennaExecutor integration for streaming and batch execution. Other modalities, such as text and image, also use RAPIDS and Curator’s distributed backends in parts of their workflows. Refer to the modality-specific guides for details.

Infrastructure Components

The video curation concepts build on NVIDIA NeMo Curator’s core infrastructure components. All modalities (text, image, video, and audio) use these components. These components include:

Memory Management

Optimize memory usage for large datasets partitioning batching monitoring

GPU Acceleration

Leverage NVIDIA GPU acceleration for faster data processing cuda rmm performance

Resumable Processing

Continue interrupted operations on large datasets checkpoints recovery batching