Data Flow#
Understanding how data moves through NeMo Curator’s video curation pipelines is key to optimizing performance and resource usage.
Data moves between stages via Ray’s distributed object store, enabling efficient, in-memory transfer between distributed actors.
In streaming mode (where stages operate continuously rather than in batches), the executor returns only final-stage outputs while keeping intermediate state in memory. This reduces I/O overhead and significantly improves throughput.
The auto-scaling component continuously balances resources to maximize pipeline throughput, dynamically allocating workers to stages as needed.
Writer stages persist outputs at the end of the pipeline, including clip media, embeddings (pickle and parquet variants), and metadata JSON files.
Together, these components enable efficient processing of large-scale video datasets with minimal data movement and optimal use of available hardware.
Writer Output Layout#
Writer stages produce the following directories under the configured output path:
clips/: MP4 clip filesfiltered_clips/: MP4 files for filtered clipspreviews/: WebP preview images for windowsmetas/v0/: Per-clip JSON metadata filesce1_embd/: Per-clip embeddings (pickle) for Cosmos-Embed1ce1_embd_parquet/: Aggregated per-video embeddings (parquet) for Cosmos-Embed1processed_videos/: Per-video JSON metadata filesprocessed_clip_chunks/: Per-clip-chunk JSON statistics