***

description: >-
Understanding data flow in video curation pipelines including Ray object store
and streaming optimization
categories:

* concepts-architecture
  tags:
* data-flow
* distributed
* ray
* streaming
* performance
* video-curation
  personas:
* data-scientist-focused
* mle-focused
  difficulty: intermediate
  content\_type: concept
  modality: video-only

***

# Data Flow

Understanding how data moves through NeMo Curator's video curation pipelines is key to optimizing performance and resource usage.

* Data moves between stages via Ray's distributed object store, enabling efficient, in-memory transfer between distributed actors.
* In streaming mode (where stages operate continuously rather than in batches), the executor returns only final-stage outputs while keeping intermediate state in memory. This reduces I/O overhead and significantly improves throughput.
* The auto-scaling component continuously balances resources to maximize pipeline throughput, dynamically allocating workers to stages as needed.
* Writer stages persist outputs at the end of the pipeline, including clip media, embeddings (pickle and parquet variants), and metadata JSON files.

Together, these components enable efficient processing of large-scale video datasets with minimal data movement and optimal use of available hardware.

## Writer Output Layout

Writer stages produce the following directories under the configured output path:

* `clips/`: MP4 clip files
* `filtered_clips/`: MP4 files for filtered clips
* `previews/`: WebP preview images for windows
* `metas/v0/`: Per-clip JSON metadata files
* `ce1_embd/`: Per-clip embeddings (pickle) for Cosmos-Embed1
* `ce1_embd_parquet/`: Aggregated per-video embeddings (parquet) for Cosmos-Embed1
* `processed_videos/`: Per-video JSON metadata files
* `processed_clip_chunks/`: Per-clip-chunk JSON statistics
