*** description: >- Understanding data flow in video curation pipelines including Ray object store and streaming optimization categories: * concepts-architecture tags: * data-flow * distributed * ray * streaming * performance * video-curation personas: * data-scientist-focused * mle-focused difficulty: intermediate content\_type: concept modality: video-only *** # Data Flow Understanding how data moves through NeMo Curator's video curation pipelines is key to optimizing performance and resource usage. * Data moves between stages via Ray's distributed object store, enabling efficient, in-memory transfer between distributed actors. * In streaming mode (where stages operate continuously rather than in batches), the executor returns only final-stage outputs while keeping intermediate state in memory. This reduces I/O overhead and significantly improves throughput. * The auto-scaling component continuously balances resources to maximize pipeline throughput, dynamically allocating workers to stages as needed. * Writer stages persist outputs at the end of the pipeline, including clip media, embeddings (pickle and parquet variants), and metadata JSON files. Together, these components enable efficient processing of large-scale video datasets with minimal data movement and optimal use of available hardware. ## Writer Output Layout Writer stages produce the following directories under the configured output path: * `clips/`: MP4 clip files * `filtered_clips/`: MP4 files for filtered clips * `previews/`: WebP preview images for windows * `metas/v0/`: Per-clip JSON metadata files * `ce1_embd/`: Per-clip embeddings (pickle) for Cosmos-Embed1 * `ce1_embd_parquet/`: Aggregated per-video embeddings (parquet) for Cosmos-Embed1 * `processed_videos/`: Per-video JSON metadata files * `processed_clip_chunks/`: Per-clip-chunk JSON statistics