Data Flow#
Understanding how data moves through NeMo Curator’s video curation pipelines is key to optimizing performance and resource usage.
Data moves between stages via Ray’s distributed object store, enabling efficient, in-memory transfer between distributed actors.
In streaming mode, the executor returns final stage outputs while intermediate state stays in memory, reducing I/O overhead and improving throughput.
The auto-scaling component continuously balances resources to maximize pipeline throughput, dynamically allocating workers to stages as needed.
Writer stages persist outputs at the end of the pipeline, including clip media, embeddings (pickle and parquet variants), and metadata JSON files.
This architecture enables efficient processing of large-scale video datasets, with minimal data movement and optimal use of available hardware.
Writer Output Layout#
Writer stages produce the following directories under the configured output path:
clips/
: MP4 clip filesfiltered_clips/
: MP4 files for filtered clipspreviews/
: WebP preview images for windowsmetas/v0/
: Per-clip JSON metadataiv2_embd/
: Per-clip embeddings (pickle) for InternVideo2iv2_embd_parquet/
: Per-video embeddings (parquet) for InternVideo2ce1_embd/
: Per-clip embeddings (pickle) for Cosmos-Embed1ce1_embd_parquet/
: Per-video embeddings (parquet) for Cosmos-Embed1processed_videos/
: Per-video JSON metadataprocessed_clip_chunks/
: Per-clip-chunk JSON statistics