Save and Export#
NeMo Curator writes clips, metadata, previews, and embeddings to a structured output directory. Use this guide to add the writer to your pipeline, understand the directories it creates, and prepare artifacts for training.
Writer Stage#
Use ClipWriterStage
as the final stage in your pipeline.
from nemo_curator.stages.video.io.clip_writer import ClipWriterStage
pipeline.add_stage(
ClipWriterStage(
output_path=OUT_DIR,
input_path=VIDEO_DIR,
upload_clips=True,
dry_run=False,
generate_embeddings=True,
generate_previews=False,
generate_captions=False,
embedding_algorithm="cosmos-embed1", # or "internvideo2"
caption_models=["qwen"],
enhanced_caption_models=["qwen_lm"],
verbose=True,
)
)
Parameters#
Parameter |
Type |
Description |
---|---|---|
|
|
Base directory or URI for outputs. |
|
|
Root of input videos; used to derive processed metadata paths. Must be a prefix of input video paths. |
|
|
Write |
|
|
Skip writing clip bytes, preview images, embeddings, and per-clip metadata. The stage still writes video-level and chunk-level metadata. |
|
|
When true, the stage logs errors if embeddings for the selected algorithm are missing. When embeddings exist, the stage writes per-clip pickles and per-chunk Parquet files. |
|
|
When true, the stage logs errors for missing preview bytes and writes |
|
|
The stage includes captions in metadata when upstream stages provide them. |
|
|
Accepted: |
|
|
Ordered caption models to emit. Use |
|
|
Ordered enhancement models to emit. Use |
|
|
Emit detailed logs. |
|
|
Thread pool size for writing. |
|
|
Reserved for future detailed stats logging. |
Output Directories#
The writer produces these directories under output_path
:
clips/
: Encoded clip media (.mp4
).filtered_clips/
: Media for filtered-out clips.previews/
: Preview images (.webp
).metas/v0/
: Per-clip metadata (.json
).iv2_embd/
,ce1_embd/
: Per-clip embeddings (.pickle
).iv2_embd_parquet/
,ce1_embd_parquet/
: Parquet batches with columnsid
andembedding
.processed_videos/
,processed_clip_chunks/
: Video-level metadata and per-chunk statistics.
Per-Clip Metadata#
Each clip writes a JSON file under metas/v0/
with clip- and window-level fields:
{
"span_uuid": "d2d0b3d1-...",
"source_video": "/data/videos/vid.mp4",
"duration_span": [0.0, 5.0],
"width_source": 1920,
"height_source": 1080,
"framerate_source": 30.0,
"clip_location": "/outputs/clips/d2/d2d0b3d1-....mp4",
"motion_score": { "global_mean": 0.51, "per_patch_min_256": 0.29 },
"aesthetic_score": 0.72,
"windows": [
{
"start_frame": 0,
"end_frame": 30,
"qwen_caption": "A person walks across a room",
"qwen_lm_enhanced_caption": "A person briskly crosses a bright modern room"
}
],
"valid": true
}
Caption keys follow
<model>_caption
and<model>_enhanced_caption
, based oncaption_models
andenhanced_caption_models
.With
dry_run=True
, per-clip metadata is not written. Video- and chunk-level metadata are still written.The stage writes video-level metadata and per-chunk stats to
processed_videos/
andprocessed_clip_chunks/
.
Embeddings and Parquet outputs#
When embeddings exist, the stage writes per-clip
.pickle
files underiv2_embd/
orce1_embd/
.The stage also batches embeddings per clip chunk into Parquet files under
iv2_embd_parquet/
orce1_embd_parquet/
with columnsid
andembedding
and writes those files to disk.
Helpers#
Resolve Paths Programmatically#
Use helpers to construct paths consistently:
from nemo_curator.stages.video.io.clip_writer import ClipWriterStage
OUT = "/outputs"
clips_dir = ClipWriterStage.get_output_path_clips(OUT)
filtered_clips_dir = ClipWriterStage.get_output_path_clips(OUT, filtered=True)
previews_dir = ClipWriterStage.get_output_path_previews(OUT)
metas_dir = ClipWriterStage.get_output_path_metas(OUT, "v0")
iv2_parquet_dir = ClipWriterStage.get_output_path_iv2_embd_parquet(OUT)
ce1_parquet_dir = ClipWriterStage.get_output_path_ce1_embd_parquet(OUT)
processed_videos_dir = ClipWriterStage.get_output_path_processed_videos(OUT)
processed_chunks_dir = ClipWriterStage.get_output_path_processed_clip_chunks(OUT)