Save and Export | NeMo Curator

NeMo Curator writes clips, metadata, previews, and embeddings to a structured output directory. Use this guide to add the writer to your pipeline, understand the directories it creates, and prepare artifacts for training.

Writer Stage

Use ClipWriterStage as the final stage in your pipeline.

1 from nemo_curator.stages.video.io.clip_writer import ClipWriterStage
2 
3 pipeline.add_stage(
4     ClipWriterStage(
5         output_path=OUT_DIR,
6         input_path=VIDEO_DIR,
7         upload_clips=True,
8         dry_run=False,
9         generate_embeddings=True,
10         generate_previews=False,
11         generate_captions=False,
12         embedding_algorithm="cosmos-embed1-224p",
13         caption_models=["qwen"],
14         enhanced_caption_models=["qwen_lm"],
15         verbose=True,
16     )
17 )

Parameters

Parameter	Type	Description
`output_path`	`str`	Base directory or URI for outputs.
`input_path`	`str`	Root of input videos; used to derive processed metadata paths. Must be a prefix of input video paths.
`upload_clips`	`bool`	Write `.mp4` clips to `clips/` and filtered clips to `filtered_clips/`.
`dry_run`	`bool`	Skip writing clip bytes, preview images, embeddings, and per-clip metadata. The stage still writes video-level and chunk-level metadata.
`generate_embeddings`	`bool`	When true, the stage logs errors if embeddings for the selected algorithm are missing. When embeddings exist, the stage writes per-clip pickles and per-chunk Parquet files.
`generate_previews`	`bool`	When true, the stage logs errors for missing preview bytes and writes `.webp` images when present.
`generate_captions`	`bool`	The stage includes captions in metadata when upstream stages provide them.
`embedding_algorithm`	`str`	Accepted: `cosmos-embed1-224p`, `cosmos-embed1-336p`, or `cosmos-embed1-448p`. Default: `cosmos-embed1-224p`.
`caption_models`	`list[str] \| None`	Ordered caption models to emit. Use `[]` when not using captions.
`enhanced_caption_models`	`list[str] \| None`	Ordered enhancement models to emit. Use `[]` when not using enhanced captions.
`verbose`	`bool`	Emit detailed logs.
`max_workers`	`int`	Thread pool size for writing.
`log_stats`	`bool`	Reserved for future detailed stats logging.

Output Directories

The writer produces these directories under output_path:

clips/: Encoded clip media (.mp4).
filtered_clips/: Media for filtered-out clips.
previews/: Preview images (.webp).
metas/v0/: Per-clip metadata (.json).
ce1_embd/: Per-clip embeddings (.pickle).
ce1_embd_parquet/: Parquet batches with columns id and embedding.
processed_videos/, processed_clip_chunks/: Video-level metadata and per-chunk statistics.

Per-Clip Metadata

Each clip writes a JSON file under metas/v0/ with clip- and window-level fields:

1 {
2   "span_uuid": "d2d0b3d1-...",
3   "source_video": "/data/videos/vid.mp4",
4   "duration_span": [0.0, 5.0],
5   "width_source": 1920,
6   "height_source": 1080,
7   "framerate_source": 30.0,
8   "clip_location": "/outputs/clips/d2/d2d0b3d1-....mp4",
9   "motion_score": { "global_mean": 0.51, "per_patch_min_256": 0.29 },
10   "aesthetic_score": 0.72,
11   "windows": [
12     {
13       "start_frame": 0,
14       "end_frame": 30,
15       "qwen_caption": "A person walks across a room",
16       "qwen_lm_enhanced_caption": "A person briskly crosses a bright modern room"
17     }
18   ],
19   "valid": true
20 }

Caption keys follow <model>_caption and <model>_enhanced_caption, based on caption_models and enhanced_caption_models.
With dry_run=True, per-clip metadata is not written. Video- and chunk-level metadata are still written.
The stage writes video-level metadata and per-chunk stats to processed_videos/ and processed_clip_chunks/.

Embeddings and Parquet outputs

When embeddings exist, the stage writes per-clip .pickle files under ce1_embd/.
The stage also batches embeddings per clip chunk into Parquet files under ce1_embd_parquet/ with columns id and embedding and writes those files to disk.

Helpers

Resolve Paths Programmatically

Use helpers to construct paths consistently:

1 from nemo_curator.stages.video.io.clip_writer import ClipWriterStage
2 
3 OUT = "/outputs"
4 
5 clips_dir = ClipWriterStage.get_output_path_clips(OUT)
6 filtered_clips_dir = ClipWriterStage.get_output_path_clips(OUT, filtered=True)
7 previews_dir = ClipWriterStage.get_output_path_previews(OUT)
8 metas_dir = ClipWriterStage.get_output_path_metas(OUT, "v0")
9 ce1_parquet_dir = ClipWriterStage.get_output_path_ce1_embd_parquet(OUT)
10 processed_videos_dir = ClipWriterStage.get_output_path_processed_videos(OUT)
11 processed_chunks_dir = ClipWriterStage.get_output_path_processed_clip_chunks(OUT)