Curate Video

Save and Export

View as Markdown

NeMo Curator writes clips, metadata, previews, and embeddings to a structured output directory. Use this guide to add the writer to your pipeline, understand the directories it creates, and prepare artifacts for training.

Writer Stage

Use ClipWriterStage as the final stage in your pipeline.

1from nemo_curator.stages.video.io.clip_writer import ClipWriterStage
2
3pipeline.add_stage(
4 ClipWriterStage(
5 output_path=OUT_DIR,
6 input_path=VIDEO_DIR,
7 upload_clips=True,
8 dry_run=False,
9 generate_embeddings=True,
10 generate_previews=False,
11 generate_captions=False,
12 embedding_algorithm="cosmos-embed1-224p",
13 caption_models=["qwen"],
14 enhanced_caption_models=["qwen_lm"],
15 verbose=True,
16 )
17)

Parameters

ParameterTypeDescription
output_pathstrBase directory or URI for outputs.
input_pathstrRoot of input videos; used to derive processed metadata paths. Must be a prefix of input video paths.
upload_clipsboolWrite .mp4 clips to clips/ and filtered clips to filtered_clips/.
dry_runboolSkip writing clip bytes, preview images, embeddings, and per-clip metadata. The stage still writes video-level and chunk-level metadata.
generate_embeddingsboolWhen true, the stage logs errors if embeddings for the selected algorithm are missing. When embeddings exist, the stage writes per-clip pickles and per-chunk Parquet files.
generate_previewsboolWhen true, the stage logs errors for missing preview bytes and writes .webp images when present.
generate_captionsboolThe stage includes captions in metadata when upstream stages provide them.
embedding_algorithmstrAccepted: cosmos-embed1-224p, cosmos-embed1-336p, or cosmos-embed1-448p. Default: cosmos-embed1-224p.
caption_modelslist[str] | NoneOrdered caption models to emit. Use [] when not using captions.
enhanced_caption_modelslist[str] | NoneOrdered enhancement models to emit. Use [] when not using enhanced captions.
verboseboolEmit detailed logs.
max_workersintThread pool size for writing.
log_statsboolReserved for future detailed stats logging.

Output Directories

The writer produces these directories under output_path:

  • clips/: Encoded clip media (.mp4).
  • filtered_clips/: Media for filtered-out clips.
  • previews/: Preview images (.webp).
  • metas/v0/: Per-clip metadata (.json).
  • ce1_embd/: Per-clip embeddings (.pickle).
  • ce1_embd_parquet/: Parquet batches with columns id and embedding.
  • processed_videos/, processed_clip_chunks/: Video-level metadata and per-chunk statistics.

Per-Clip Metadata

Each clip writes a JSON file under metas/v0/ with clip- and window-level fields:

1{
2 "span_uuid": "d2d0b3d1-...",
3 "source_video": "/data/videos/vid.mp4",
4 "duration_span": [0.0, 5.0],
5 "width_source": 1920,
6 "height_source": 1080,
7 "framerate_source": 30.0,
8 "clip_location": "/outputs/clips/d2/d2d0b3d1-....mp4",
9 "motion_score": { "global_mean": 0.51, "per_patch_min_256": 0.29 },
10 "aesthetic_score": 0.72,
11 "windows": [
12 {
13 "start_frame": 0,
14 "end_frame": 30,
15 "qwen_caption": "A person walks across a room",
16 "qwen_lm_enhanced_caption": "A person briskly crosses a bright modern room"
17 }
18 ],
19 "valid": true
20}
  • Caption keys follow <model>_caption and <model>_enhanced_caption, based on caption_models and enhanced_caption_models.
  • With dry_run=True, per-clip metadata is not written. Video- and chunk-level metadata are still written.
  • The stage writes video-level metadata and per-chunk stats to processed_videos/ and processed_clip_chunks/.

Embeddings and Parquet outputs

  • When embeddings exist, the stage writes per-clip .pickle files under ce1_embd/.
  • The stage also batches embeddings per clip chunk into Parquet files under ce1_embd_parquet/ with columns id and embedding and writes those files to disk.

Helpers

Resolve Paths Programmatically

Use helpers to construct paths consistently:

1from nemo_curator.stages.video.io.clip_writer import ClipWriterStage
2
3OUT = "/outputs"
4
5clips_dir = ClipWriterStage.get_output_path_clips(OUT)
6filtered_clips_dir = ClipWriterStage.get_output_path_clips(OUT, filtered=True)
7previews_dir = ClipWriterStage.get_output_path_previews(OUT)
8metas_dir = ClipWriterStage.get_output_path_metas(OUT, "v0")
9ce1_parquet_dir = ClipWriterStage.get_output_path_ce1_embd_parquet(OUT)
10processed_videos_dir = ClipWriterStage.get_output_path_processed_videos(OUT)
11processed_chunks_dir = ClipWriterStage.get_output_path_processed_clip_chunks(OUT)