For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI Reference
DocumentationAPI Reference
  • Home
    • Welcome
  • About NeMo Curator
    • Overview
    • Key Features
      • Overview
      • Deduplication
        • Overview
        • Architecture
        • Abstractions
        • Data Flow
  • Get Started
    • Overview
    • Install (All Modalities)
    • Text Quickstart
    • Image Quickstart
    • Video Quickstart
    • Audio Quickstart
  • Curate Text
    • Overview
    • Tutorials
    • Save and Export
  • Curate Images
    • Overview
    • Save and Export
  • Curate Video
    • Overview
    • Load Data
    • Save and Export
  • Curate Audio
    • Overview
    • Save and Export
  • Setup & Deployment
    • Overview
  • Reference
    • Overview
    • Related Tools
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Curator
On this page
  • Writer Output Layout
About NeMo CuratorConceptsVideo Concepts

Data Flow

||View as Markdown|

Understanding how data moves through NeMo Curator’s video curation pipelines is key to optimizing performance and resource usage.

  • Data moves between stages via Ray’s distributed object store, enabling efficient, in-memory transfer between distributed actors.
  • In streaming mode (where stages operate continuously rather than in batches), the executor returns only final-stage outputs while keeping intermediate state in memory. This reduces I/O overhead and significantly improves throughput.
  • The auto-scaling component continuously balances resources to maximize pipeline throughput, dynamically allocating workers to stages as needed.
  • Writer stages persist outputs at the end of the pipeline, including clip media, embeddings (pickle and parquet variants), and metadata JSON files.

Together, these components enable efficient processing of large-scale video datasets with minimal data movement and optimal use of available hardware.

Writer Output Layout

Writer stages produce the following directories under the configured output path:

  • clips/: MP4 clip files
  • filtered_clips/: MP4 files for filtered clips
  • previews/: WebP preview images for windows
  • metas/v0/: Per-clip JSON metadata files
  • ce1_embd/: Per-clip embeddings (pickle) for Cosmos-Embed1
  • ce1_embd_parquet/: Aggregated per-video embeddings (parquet) for Cosmos-Embed1
  • processed_videos/: Per-video JSON metadata files
  • processed_clip_chunks/: Per-clip-chunk JSON statistics
Previous

Abstractions

Next

Overview