For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
DocumentationAPI Reference
DocumentationAPI Reference
  • Home
    • Welcome
  • About NeMo Curator
    • Overview
    • Key Features
  • Get Started
    • Overview
    • Text Quickstart
    • Image Quickstart
    • Video Quickstart
    • Audio Quickstart
  • Curate Text
    • Overview
    • Tutorials
  • Curate Images
    • Overview
    • Save and Export
  • Curate Video
    • Overview
    • Load Data
      • Overview
      • Clipping
      • Transcoding
      • Filtering
      • Embeddings
      • Deduplication
      • Frame Extraction
      • Captions Preview
    • Save and Export
  • Curate Audio
    • Overview
    • Save and Export
  • Setup & Deployment
    • Overview
    • Installation
  • Reference
    • Overview
    • Related Tools
NVIDIANVIDIA
Developer-friendly docs for your API
Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2026, NVIDIA Corporation.

LogoLogoNeMo Curator
On this page
  • Quickstart
  • Preparation and previews
  • Parameters
  • Caption generation and enhancement
  • Parameters
  • Preview Generation
  • Preview Parameters
  • Example: Configure PreviewStage
  • Outputs
  • Requirements and Troubleshooting
Curate VideoProcess Data

Captions and Preview

||View as Markdown|
Previous

Frame Extraction

Next

Save and Export

Prepare inputs, generate captions, optionally enhance them, and produce preview images.


Quickstart

Use the pipeline stages or the example script flags to prepare captions and preview images.

Pipeline Stage
Script Flags
1from nemo_curator.pipeline import Pipeline
2from nemo_curator.stages.video.caption.caption_preparation import CaptionPreparationStage
3from nemo_curator.stages.video.caption.caption_generation import CaptionGenerationStage
4from nemo_curator.stages.video.caption.caption_enhancement import CaptionEnhancementStage
5from nemo_curator.stages.video.preview.preview import PreviewStage
6
7pipe = Pipeline(name="captions_preview")
8pipe.add_stage(
9 CaptionPreparationStage(
10 model_variant="qwen",
11 prompt_variant="default",
12 prompt_text=None,
13 sampling_fps=2.0,
14 window_size=256,
15 remainder_threshold=128,
16 preprocess_dtype="float16",
17 model_does_preprocess=False,
18 generate_previews=True,
19 verbose=True,
20 )
21)
22pipe.add_stage(PreviewStage(target_fps=1.0, target_height=240, verbose=True))
23pipe.add_stage(
24 CaptionGenerationStage(
25 model_dir="/models",
26 model_variant="qwen",
27 caption_batch_size=8,
28 fp8=False,
29 max_output_tokens=512,
30 model_does_preprocess=False,
31 generate_stage2_caption=False,
32 stage2_prompt_text=None,
33 disable_mmcache=True,
34 )
35)
36pipe.run()

Preparation and previews

  1. Prepare caption inputs from each clip window. This step splits clips into fixed windows, formats model‑ready inputs for Qwen‑VL, and optionally stores per‑window mp4 bytes for previews.

    1from nemo_curator.stages.video.caption.caption_preparation import CaptionPreparationStage
    2from nemo_curator.stages.video.preview.preview import PreviewStage
    3
    4prep = CaptionPreparationStage(
    5 model_variant="qwen",
    6 prompt_variant="default",
    7 prompt_text=None,
    8 sampling_fps=2.0,
    9 window_size=256,
    10 remainder_threshold=128,
    11 preprocess_dtype="float16",
    12 model_does_preprocess=False,
    13 generate_previews=True,
    14 verbose=True,
    15)
  2. Optionally generate .webp previews from each window’s mp4 bytes for quick QA and review.

    1preview = PreviewStage(
    2 target_fps=1.0,
    3 target_height=240,
    4 verbose=True,
    5)

Parameters

CaptionPreparationStage
PreviewStage

Caption preparation parameters

ParameterTypeDefaultDescription
model_variantstr"qwen"Vision‑language model used to format inputs for captioning (currently qwen).
prompt_variantav-surveillance"default"Built‑in prompt to steer caption content when prompt_text is not provided.
prompt_textstrNoneNone
sampling_fpsfloat2.0Source sampling rate for creating per‑window inputs.
window_sizeint256Number of frames per window before captioning.
remainder_thresholdint128Minimum leftover frames required to create a final shorter window.
model_does_preprocessboolFalseWhether the downstream model performs its own preprocessing.
preprocess_dtypestr"float32"Data type for any preprocessing performed here.
generate_previewsboolTrueWhen True, return per‑window mp4 bytes to enable preview generation.
verboseboolFalseLog additional setup and per‑clip details.

Caption generation and enhancement

  1. Generate window‑level captions with a vision‑language model (Qwen‑VL). This stage reads clip.windows[*].qwen_llm_input created earlier and writes window.caption["qwen"].

    1from nemo_curator.stages.video.caption.caption_generation import CaptionGenerationStage
    2from nemo_curator.stages.video.caption.caption_enhancement import CaptionEnhancementStage
    3
    4gen = CaptionGenerationStage(
    5 model_dir="/models",
    6 model_variant="qwen",
    7 caption_batch_size=8,
    8 fp8=False,
    9 max_output_tokens=512,
    10 model_does_preprocess=False,
    11 generate_stage2_caption=False,
    12 stage2_prompt_text=None,
    13 disable_mmcache=True,
    14)
  2. Optionally enhance captions with a text‑based LLM (Qwen‑LM) to expand and refine descriptions. This stage reads window.caption["qwen"] and writes window.enhanced_caption["qwen_lm"].

    1enh = CaptionEnhancementStage(
    2 model_dir="/models",
    3 model_variant="qwen",
    4 prompt_variant="default",
    5 prompt_text=None,
    6 model_batch_size=128,
    7 fp8=False,
    8 max_output_tokens=512,
    9 verbose=True,
    10)

Parameters

CaptionGenerationStage
CaptionEnhancementStage

Caption generation parameters

ParameterTypeDefaultDescription
model_dirstr"models/qwen"Directory for model weights; downloaded on each node if missing.
model_variantqwen"qwen"Vision‑language model variant.
caption_batch_sizeint16Batch size for caption generation.
fp8boolFalseUse FP8 weights when available.
max_output_tokensint512Maximum number of tokens to generate per caption.
model_does_preprocessboolFalseWhether the model performs its own preprocessing.
disable_mmcacheboolFalseDisable multimodal cache for generation backends that support it.
generate_stage2_captionboolFalseEnable a second‑pass caption for refinement.
stage2_prompt_textstrNoneNone
verboseboolFalseEmit additional logs during generation.

Preview Generation

Generate lightweight .webp previews for each caption window to support review and QA workflows. A dedicated PreviewStage reads per-window mp4 bytes and encodes WebP using ffmpeg.

Preview Parameters

  • target_fps (default 1.0): Target frames per second for preview generation.
  • target_height (default 240): Output height. Width auto-scales to preserve aspect ratio.
  • compression_level (range 0–6, default 6): WebP compression level. 0 is lossless; higher values reduce size with lower quality.
  • quality (range 0–100, default 50): WebP quality. Higher values increase quality and size.
  • num_cpus_per_worker (default 4.0): Number of CPU threads mapped to ffmpeg -threads.
  • verbose (default False): Emit more logs.

Behavior notes:

  • If the input frame rate is lower than target_fps or the input height is lower than target_height, the stage logs a warning and preview quality can degrade.
  • If ffmpeg fails, the stage logs the error and skips assigning preview bytes for that window.

Example: Configure PreviewStage

1from nemo_curator.stages.video.preview.preview import PreviewStage
2
3preview = PreviewStage(
4 target_fps=1.0,
5 target_height=240,
6 compression_level=6,
7 quality=50,
8 num_cpus_per_worker=4.0,
9 verbose=False,
10)

Outputs

The stage writes .webp files under the previews/ directory that ClipWriterStage manages. Use the helper to resolve the path:

1from nemo_curator.stages.video.io.clip_writer import ClipWriterStage
2previews_dir = ClipWriterStage.get_output_path_previews("/outputs")

Refer to Save & Export for directory structure and file locations: Save & Export.

Requirements and Troubleshooting

  • ffmpeg with WebP (libwebp) support must be available in the environment.
  • If you observe warnings about low frame rate or height, consider lowering target_fps or target_height to better match inputs.
  • On encoding errors, check logs for the ffmpeg command and output to diagnose missing encoders.