Data Flow | NeMo Curator

Understanding how data moves through NeMo Curator’s video curation pipelines is key to optimizing performance and resource usage.

Data moves between stages via Ray’s distributed object store, enabling efficient, in-memory transfer between distributed actors.
In streaming mode (where stages operate continuously rather than in batches), the executor returns only final-stage outputs while keeping intermediate state in memory. This reduces I/O overhead and significantly improves throughput.
The auto-scaling component continuously balances resources to maximize pipeline throughput, dynamically allocating workers to stages as needed.
Writer stages persist outputs at the end of the pipeline, including clip media, embeddings (pickle and parquet variants), and metadata JSON files.

Together, these components enable efficient processing of large-scale video datasets with minimal data movement and optimal use of available hardware.

Writer Output Layout

Writer stages produce the following directories under the configured output path:

clips/: MP4 clip files
filtered_clips/: MP4 files for filtered clips
previews/: WebP preview images for windows
metas/v0/: Per-clip JSON metadata files
ce1_embd/: Per-clip embeddings (pickle) for Cosmos-Embed1
ce1_embd_parquet/: Aggregated per-video embeddings (parquet) for Cosmos-Embed1
processed_videos/: Per-video JSON metadata files
processed_clip_chunks/: Per-clip-chunk JSON statistics