About NeMo CuratorConceptsVideo Concepts

Data Flow

View as Markdown

Understanding how data moves through NeMo Curator’s video curation pipelines is key to optimizing performance and resource usage.

  • Data moves between stages via Ray’s distributed object store, enabling efficient, in-memory transfer between distributed actors.
  • In streaming mode (where stages operate continuously rather than in batches), the executor returns only final-stage outputs while keeping intermediate state in memory. This reduces I/O overhead and significantly improves throughput.
  • The auto-scaling component continuously balances resources to maximize pipeline throughput, dynamically allocating workers to stages as needed.
  • Writer stages persist outputs at the end of the pipeline, including clip media, embeddings (pickle and parquet variants), and metadata JSON files.

Together, these components enable efficient processing of large-scale video datasets with minimal data movement and optimal use of available hardware.

Writer Output Layout

Writer stages produce the following directories under the configured output path:

  • clips/: MP4 clip files
  • filtered_clips/: MP4 files for filtered clips
  • previews/: WebP preview images for windows
  • metas/v0/: Per-clip JSON metadata files
  • ce1_embd/: Per-clip embeddings (pickle) for Cosmos-Embed1
  • ce1_embd_parquet/: Aggregated per-video embeddings (parquet) for Cosmos-Embed1
  • processed_videos/: Per-video JSON metadata files
  • processed_clip_chunks/: Per-clip-chunk JSON statistics