Captions and Preview#
Prepare inputs, generate captions, optionally enhance them, and produce preview images.
Quickstart#
Use the pipeline stages or the example script flags to prepare captions and preview images.
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.video.caption.caption_preparation import CaptionPreparationStage
from nemo_curator.stages.video.caption.caption_generation import CaptionGenerationStage
from nemo_curator.stages.video.caption.caption_enhancement import CaptionEnhancementStage
from nemo_curator.stages.video.preview.preview import PreviewStage
pipe = Pipeline(name="captions_preview")
pipe.add_stage(
CaptionPreparationStage(
model_variant="qwen",
prompt_variant="default",
prompt_text=None,
sampling_fps=2.0,
window_size=256,
remainder_threshold=128,
preprocess_dtype="float16",
model_does_preprocess=False,
generate_previews=True,
verbose=True,
)
)
pipe.add_stage(PreviewStage(target_fps=1.0, target_height=240, verbose=True))
pipe.add_stage(
CaptionGenerationStage(
model_dir="/models",
model_variant="qwen",
caption_batch_size=8,
fp8=False,
max_output_tokens=512,
model_does_preprocess=False,
generate_stage2_caption=False,
stage2_prompt_text=None,
disable_mmcache=True,
)
)
pipe.run()
python -m nemo_curator.examples.video.video_split_clip_example \
... \
--generate-captions \
--captioning-algorithm qwen \
--captioning-window-size 256 \
--captioning-remainder-threshold 128 \
--captioning-sampling-fps 2.0 \
--captioning-preprocess-dtype float16 \
--captioning-batch-size 8 \
--captioning-max-output-tokens 512 \
--generate-previews \
--preview-target-fps 1.0 \
--preview-target-height 240
Preparation and previews#
Prepare caption inputs from each clip window. This step splits clips into fixed windows, formats model‑ready inputs for Qwen‑VL, and optionally stores per‑window
mp4
bytes for previews.from nemo_curator.stages.video.caption.caption_preparation import CaptionPreparationStage from nemo_curator.stages.video.preview.preview import PreviewStage prep = CaptionPreparationStage( model_variant="qwen", prompt_variant="default", prompt_text=None, sampling_fps=2.0, window_size=256, remainder_threshold=128, preprocess_dtype="float16", model_does_preprocess=False, generate_previews=True, verbose=True, )
Optionally generate
.webp
previews from each window’smp4
bytes for quick QA and review.preview = PreviewStage( target_fps=1.0, target_height=240, verbose=True, )
Parameters#
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
str |
|
Vision‑language model used to format inputs for captioning (currently |
|
{“default”, “av”, “av-surveillance”} |
|
Built‑in prompt to steer caption content when |
|
str | None |
|
Custom prompt text. When set, overrides |
|
float |
2.0 |
Source sampling rate for creating per‑window inputs. |
|
int |
256 |
Number of frames per window before captioning. |
|
int |
128 |
Minimum leftover frames required to create a final shorter window. |
|
bool |
|
Whether the downstream model performs its own preprocessing. |
|
str |
|
Data type for any preprocessing performed here. |
|
bool |
|
When |
|
bool |
|
Log additional setup and per‑clip details. |
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
float |
1.0 |
Frames per second for preview encoding. |
|
int |
240 |
Output height in pixels; width auto‑scales to preserve aspect ratio. |
|
int (0–6) |
6 |
WebP compression level ( |
|
int (0–100) |
50 |
WebP quality factor ( |
|
float |
4.0 |
CPU threads mapped to |
|
bool |
|
Log warnings and per‑window encoding details. |
Caption generation and enhancement#
Generate window‑level captions with a vision‑language model (Qwen‑VL). This stage reads
clip.windows[*].qwen_llm_input
created earlier and writeswindow.caption["qwen"]
.from nemo_curator.stages.video.caption.caption_generation import CaptionGenerationStage from nemo_curator.stages.video.caption.caption_enhancement import CaptionEnhancementStage gen = CaptionGenerationStage( model_dir="/models", model_variant="qwen", caption_batch_size=8, fp8=False, max_output_tokens=512, model_does_preprocess=False, generate_stage2_caption=False, stage2_prompt_text=None, disable_mmcache=True, )
Optionally enhance captions with a text‑based LLM (Qwen‑LM) to expand and refine descriptions. This stage reads
window.caption["qwen"]
and writeswindow.enhanced_caption["qwen_lm"]
.enh = CaptionEnhancementStage( model_dir="/models", model_variant="qwen", prompt_variant="default", prompt_text=None, model_batch_size=128, fp8=False, max_output_tokens=512, verbose=True, )
Parameters#
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
str |
|
Directory for model weights; downloaded on each node if missing. |
|
{“qwen”} |
|
Vision‑language model variant. |
|
int |
16 |
Batch size for caption generation. |
|
bool |
|
Use FP8 weights when available. |
|
int |
512 |
Maximum number of tokens to generate per caption. |
|
bool |
|
Whether the model performs its own preprocessing. |
|
bool |
|
Disable multimodal cache for generation backends that support it. |
|
bool |
|
Enable a second‑pass caption for refinement. |
|
str | None |
|
Custom prompt for stage‑2 caption refinement. |
|
bool |
|
Emit additional logs during generation. |
Parameter |
Type |
Default |
Description |
---|---|---|---|
|
str |
|
Directory for language‑model weights; downloaded per node if missing. |
|
{“qwen”} |
|
Language‑model variant. |
|
{“default”, “av-surveillance”} |
|
Built‑in enhancement prompt when |
|
str | None |
|
Custom enhancement prompt. When set, overrides |
|
int |
128 |
Batch size for enhancement generation. |
|
bool |
|
Use FP8 weights when available. |
|
int |
512 |
Maximum number of tokens to generate per enhanced caption. |
|
bool |
|
Emit additional logs during enhancement. |
Preview Generation#
Generate lightweight .webp
previews for each caption window to support review and QA workflows. A dedicated PreviewStage
reads per-window mp4
bytes and encodes WebP using ffmpeg
.
Preview Parameters#
target_fps
(default1.0
): Target frames per second for preview generation.target_height
(default240
): Output height. Width auto-scales to preserve aspect ratio.compression_level
(range0–6
, default6
): WebP compression level.0
is lossless; higher values reduce size with lower quality.quality
(range0–100
, default50
): WebP quality. Higher values increase quality and size.num_cpus_per_worker
(default4.0
): Number of CPU threads mapped toffmpeg -threads
.verbose
(defaultFalse
): Emit more logs.
Behavior notes:
If the input frame rate is lower than
target_fps
or the input height is lower thantarget_height
, the stage logs a warning and preview quality can degrade.If
ffmpeg
fails, the stage logs the error and skips assigning preview bytes for that window.
Example: Configure PreviewStage#
from nemo_curator.stages.video.preview.preview import PreviewStage
preview = PreviewStage(
target_fps=1.0,
target_height=240,
compression_level=6,
quality=50,
num_cpus_per_worker=4.0,
verbose=False,
)
Outputs#
The stage writes .webp
files under the previews/
directory that ClipWriterStage
manages. Use the helper to resolve the path:
from nemo_curator.stages.video.io.clip_writer import ClipWriterStage
previews_dir = ClipWriterStage.get_output_path_previews("/outputs")
Refer to Save & Export for directory structure and file locations: Save & Export.
Requirements and Troubleshooting#
ffmpeg
with WebP (libwebp
) support must be available in the environment.If you observe warnings about low frame rate or height, consider lowering
target_fps
ortarget_height
to better match inputs.On encoding errors, check logs for the
ffmpeg
command and output to diagnose missing encoders.