Data Flow#

Understanding how data moves through NeMo Curator’s video curation pipelines is key to optimizing performance and resource usage.

Data moves between stages via Ray’s distributed object store, enabling efficient, in-memory transfer between distributed actors.
In streaming mode, the executor returns final stage outputs while intermediate state stays in memory, reducing I/O overhead and improving throughput.
The auto-scaling component continuously balances resources to maximize pipeline throughput, dynamically allocating workers to stages as needed.
Writer stages persist outputs at the end of the pipeline, including clip media, embeddings (pickle and parquet variants), and metadata JSON files.

This architecture enables efficient processing of large-scale video datasets, with minimal data movement and optimal use of available hardware.

Writer Output Layout#

Writer stages produce the following directories under the configured output path: