Captions and Preview
Prepare inputs, generate captions, optionally enhance them, and produce preview images.
Choosing a Captioning Model
The video captioning pipeline supports two model families. Pick a variant based on quality, GPU memory, and throughput:
Caption enhancement (the optional second-pass LLM rewrite) uses Qwen-LM (--enhance-captions-algorithm qwen_lm).
Quickstart
Use the pipeline stages or the example script flags to prepare captions and preview images.
Pipeline Stage
Script Flags
To use Nemotron instead, set model_variant="nemotron" (or one of nemotron-bf16, nemotron-fp8, nemotron-nvfp4) on both CaptionPreparationStage and CaptionGenerationStage — Nemotron weights are auto-downloaded from Hugging Face on first use.
Preparation and previews
-
Prepare caption inputs from each clip window. This step splits clips into fixed windows, formats model‑ready inputs for the chosen VLM (Qwen‑VL or Nemotron), and optionally stores per‑window
mp4bytes for previews. -
Optionally generate
.webppreviews from each window’smp4bytes for quick QA and review.
Parameters
CaptionPreparationStage
PreviewStage
Caption generation and enhancement
-
Generate window‑level captions with the chosen VLM (Qwen‑VL or Nemotron). This stage reads
clip.windows[*].qwen_llm_input(created earlier) and writeswindow.caption["qwen"](orwindow.caption["nemotron"], depending on the variant). -
Optionally enhance captions with a text‑based LLM (Qwen‑LM) to expand and refine descriptions. This stage reads
window.caption["qwen"]and writeswindow.enhanced_caption["qwen_lm"].
Parameters
CaptionGenerationStage
CaptionEnhancementStage
Preview Generation
Generate lightweight .webp previews for each caption window to support review and QA workflows. A dedicated PreviewStage reads per-window mp4 bytes and encodes WebP using ffmpeg.
Preview Parameters
target_fps(default1.0): Target frames per second for preview generation.target_height(default240): Output height. Width auto-scales to preserve aspect ratio.compression_level(range0–6, default6): WebP compression level.0is lossless; higher values reduce size with lower quality.quality(range0–100, default50): WebP quality. Higher values increase quality and size.num_cpus_per_worker(default4.0): Number of CPU threads mapped toffmpeg -threads.verbose(defaultFalse): Emit more logs.
Behavior notes:
- If the input frame rate is lower than
target_fpsor the input height is lower thantarget_height, the stage logs a warning and preview quality can degrade. - If
ffmpegfails, the stage logs the error and skips assigning preview bytes for that window.
Example: Configure PreviewStage
Outputs
The stage writes .webp files under the previews/ directory that ClipWriterStage manages. Use the helper to resolve the path:
Refer to Save & Export for directory structure and file locations: Save & Export.
Requirements and Troubleshooting
ffmpegwith WebP (libwebp) support must be available in the environment.- If you observe warnings about low frame rate or height, consider lowering
target_fpsortarget_heightto better match inputs. - On encoding errors, check logs for the
ffmpegcommand and output to diagnose missing encoders.