NeMo Curator writes clips, metadata, previews, and embeddings to a structured output directory. Use this guide to add the writer to your pipeline, understand the directories it creates, and prepare artifacts for training.
Use ClipWriterStage as the final stage in your pipeline.
The writer produces these directories under output_path:
clips/: Encoded clip media (.mp4).filtered_clips/: Media for filtered-out clips.previews/: Preview images (.webp).metas/v0/: Per-clip metadata (.json).ce1_embd/: Per-clip embeddings (.pickle).ce1_embd_parquet/: Parquet batches with columns id and embedding.processed_videos/, processed_clip_chunks/: Video-level metadata and per-chunk statistics.Each clip writes a JSON file under metas/v0/ with clip- and window-level fields:
<model>_caption and <model>_enhanced_caption, based on caption_models and enhanced_caption_models.dry_run=True, per-clip metadata is not written. Video- and chunk-level metadata are still written.processed_videos/ and processed_clip_chunks/..pickle files under ce1_embd/.ce1_embd_parquet/ with columns id and embedding and writes those files to disk.Use helpers to construct paths consistently: