Save and Export#

NeMo Curator writes clips, metadata, previews, and embeddings to a structured output directory. Use this guide to add the writer to your pipeline, understand the directories it creates, and prepare artifacts for training.

Writer Stage#

Use ClipWriterStage as the final stage in your pipeline.

from nemo_curator.stages.video.io.clip_writer import ClipWriterStage

pipeline.add_stage(
    ClipWriterStage(
        output_path=OUT_DIR,
        input_path=VIDEO_DIR,
        upload_clips=True,
        dry_run=False,
        generate_embeddings=True,
        generate_previews=False,
        generate_captions=False,
        embedding_algorithm="cosmos-embed1",  # or "internvideo2"
        caption_models=["qwen"],
        enhanced_caption_models=["qwen_lm"],
        verbose=True,
    )
)

Parameters#

Parameter

Type

Description

output_path

str

Base directory or URI for outputs.

input_path

str

Root of input videos; used to derive processed metadata paths. Must be a prefix of input video paths.

upload_clips

bool

Write .mp4 clips to clips/ and filtered clips to filtered_clips/.

dry_run

bool

Skip writing clip bytes, preview images, embeddings, and per-clip metadata. The stage still writes video-level and chunk-level metadata.

generate_embeddings

bool

When true, the stage logs errors if embeddings for the selected algorithm are missing. When embeddings exist, the stage writes per-clip pickles and per-chunk Parquet files.

generate_previews

bool

When true, the stage logs errors for missing preview bytes and writes .webp images when present.

generate_captions

bool

The stage includes captions in metadata when upstream stages provide them.

embedding_algorithm

str

Accepted: cosmos-embed1 or internvideo2. Default: cosmos-embed1.

caption_models

list[str] | None

Ordered caption models to emit. Use [] when not using captions.

enhanced_caption_models

list[str] | None

Ordered enhancement models to emit. Use [] when not using enhanced captions.

verbose

bool

Emit detailed logs.

max_workers

int

Thread pool size for writing.

log_stats

bool

Reserved for future detailed stats logging.

Output Directories#

The writer produces these directories under output_path:

  • clips/: Encoded clip media (.mp4).

  • filtered_clips/: Media for filtered-out clips.

  • previews/: Preview images (.webp).

  • metas/v0/: Per-clip metadata (.json).

  • iv2_embd/, ce1_embd/: Per-clip embeddings (.pickle).

  • iv2_embd_parquet/, ce1_embd_parquet/: Parquet batches with columns id and embedding.

  • processed_videos/, processed_clip_chunks/: Video-level metadata and per-chunk statistics.

Per-Clip Metadata#

Each clip writes a JSON file under metas/v0/ with clip- and window-level fields:

{
  "span_uuid": "d2d0b3d1-...",
  "source_video": "/data/videos/vid.mp4",
  "duration_span": [0.0, 5.0],
  "width_source": 1920,
  "height_source": 1080,
  "framerate_source": 30.0,
  "clip_location": "/outputs/clips/d2/d2d0b3d1-....mp4",
  "motion_score": { "global_mean": 0.51, "per_patch_min_256": 0.29 },
  "aesthetic_score": 0.72,
  "windows": [
    {
      "start_frame": 0,
      "end_frame": 30,
      "qwen_caption": "A person walks across a room",
      "qwen_lm_enhanced_caption": "A person briskly crosses a bright modern room"
    }
  ],
  "valid": true
}
  • Caption keys follow <model>_caption and <model>_enhanced_caption, based on caption_models and enhanced_caption_models.

  • With dry_run=True, per-clip metadata is not written. Video- and chunk-level metadata are still written.

  • The stage writes video-level metadata and per-chunk stats to processed_videos/ and processed_clip_chunks/.

Embeddings and Parquet outputs#

  • When embeddings exist, the stage writes per-clip .pickle files under iv2_embd/ or ce1_embd/.

  • The stage also batches embeddings per clip chunk into Parquet files under iv2_embd_parquet/ or ce1_embd_parquet/ with columns id and embedding and writes those files to disk.

Helpers#

Resolve Paths Programmatically#

Use helpers to construct paths consistently:

from nemo_curator.stages.video.io.clip_writer import ClipWriterStage

OUT = "/outputs"

clips_dir = ClipWriterStage.get_output_path_clips(OUT)
filtered_clips_dir = ClipWriterStage.get_output_path_clips(OUT, filtered=True)
previews_dir = ClipWriterStage.get_output_path_previews(OUT)
metas_dir = ClipWriterStage.get_output_path_metas(OUT, "v0")
iv2_parquet_dir = ClipWriterStage.get_output_path_iv2_embd_parquet(OUT)
ce1_parquet_dir = ClipWriterStage.get_output_path_ce1_embd_parquet(OUT)
processed_videos_dir = ClipWriterStage.get_output_path_processed_videos(OUT)
processed_chunks_dir = ClipWriterStage.get_output_path_processed_clip_chunks(OUT)