> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/curator/llms.txt.
> For full documentation content, see https://docs.nvidia.com/nemo/curator/llms-full.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/curator/_mcp/server.

> Generate clip captions with Qwen-VL or Nemotron VLMs and optional preview images

# Captions and Preview

Prepare inputs, generate captions, optionally enhance them, and produce preview images.

## Choosing a Captioning Model

The video captioning pipeline supports two model families. Pick a variant based on quality, GPU memory, and throughput:

| Variant                      | Model                               | Default Use Case                                         |
| ---------------------------- | ----------------------------------- | -------------------------------------------------------- |
| `qwen`                       | `Qwen/Qwen2.5-VL-7B-Instruct`       | Default — good quality/throughput balance                |
| `nemotron` / `nemotron-bf16` | Nemotron Nano 12B v2 VL (BF16)      | High-quality captions; auto-downloaded from Hugging Face |
| `nemotron-fp8`               | Nemotron Nano 12B v2 VL (FP8)       | Same model, FP8-quantized for lower memory               |
| `nemotron-nvfp4`             | Nemotron Nano 12B v2 VL (NVFP4-QAD) | NVFP4 quantization-aware-distilled checkpoint            |

Caption **enhancement** (the optional second-pass LLM rewrite) uses Qwen-LM (`--enhance-captions-algorithm qwen_lm`).

***

## Quickstart

Use the pipeline stages or the example script flags to prepare captions and preview images.

```python
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.video.caption.caption_preparation import CaptionPreparationStage
from nemo_curator.stages.video.caption.caption_generation import CaptionGenerationStage
from nemo_curator.stages.video.caption.caption_enhancement import CaptionEnhancementStage
from nemo_curator.stages.video.preview.preview import PreviewStage

pipe = Pipeline(name="captions_preview")
pipe.add_stage(
    CaptionPreparationStage(
        model_variant="qwen",
        prompt_variant="default",
        prompt_text=None,
        sampling_fps=2.0,
        window_size=256,
        remainder_threshold=128,
        preprocess_dtype="float16",
        model_does_preprocess=False,
        generate_previews=True,
        verbose=True,
    )
)
pipe.add_stage(PreviewStage(target_fps=1.0, target_height=240, verbose=True))
pipe.add_stage(
    CaptionGenerationStage(
        model_dir="/models",
        model_variant="qwen",
        caption_batch_size=8,
        fp8=False,
        max_output_tokens=512,
        model_does_preprocess=False,
        generate_stage2_caption=False,
        stage2_prompt_text=None,
        disable_mmcache=True,
    )
)
pipe.run()
```

To use Nemotron instead, set `model_variant="nemotron"` (or one of `nemotron-bf16`, `nemotron-fp8`, `nemotron-nvfp4`) on both `CaptionPreparationStage` and `CaptionGenerationStage` — Nemotron weights are auto-downloaded from Hugging Face on first use.

```bash
python tutorials/video/getting-started/video_split_clip_example.py \
  ... \
  --generate-captions \
  --captioning-algorithm qwen \
  --captioning-window-size 256 \
  --captioning-remainder-threshold 128 \
  --captioning-sampling-fps 2.0 \
  --captioning-preprocess-dtype float16 \
  --captioning-batch-size 8 \
  --captioning-max-output-tokens 512 \
  --generate-previews \
  --preview-target-fps 1.0 \
  --preview-target-height 240
```

`--captioning-algorithm` accepts: `qwen` (default, Qwen2.5-VL-7B-Instruct), `nemotron`, `nemotron-bf16`, `nemotron-fp8`, `nemotron-nvfp4`. To enable caption enhancement with the Qwen LM, also pass `--enhance-captions --enhance-captions-algorithm qwen_lm`.

## Preparation and previews

1. Prepare caption inputs from each clip window. This step splits clips into fixed windows, formats model‑ready inputs for the chosen VLM (Qwen‑VL or Nemotron), and optionally stores per‑window `mp4` bytes for previews.

   ```python
   from nemo_curator.stages.video.caption.caption_preparation import CaptionPreparationStage
   from nemo_curator.stages.video.preview.preview import PreviewStage

   prep = CaptionPreparationStage(
       model_variant="qwen",  # or "nemotron" / "nemotron-fp8" / ...
       prompt_variant="default",
       prompt_text=None,
       sampling_fps=2.0,
       window_size=256,
       remainder_threshold=128,
       preprocess_dtype="float16",
       model_does_preprocess=False,
       generate_previews=True,
       verbose=True,
   )
   ```

2. Optionally generate `.webp` previews from each window’s `mp4` bytes for quick QA and review.

   ```python
   preview = PreviewStage(
       target_fps=1.0,
       target_height=240,
       verbose=True,
   )
   ```

### Parameters

| Parameter               | Type                                 | Default     | Description                                                                                                                |
| ----------------------- | ------------------------------------ | ----------- | -------------------------------------------------------------------------------------------------------------------------- |
| `model_variant`         | str                                  | `"qwen"`    | Vision‑language model used to format inputs. One of `qwen`, `nemotron`, `nemotron-bf16`, `nemotron-fp8`, `nemotron-nvfp4`. |
| `prompt_variant`        | {"default", "av", "av-surveillance"} | `"default"` | Built‑in prompt to steer caption content when `prompt_text` is not provided.                                               |
| `prompt_text`           | str \| None                          | `None`      | Custom prompt text. When set, overrides `prompt_variant`.                                                                  |
| `sampling_fps`          | float                                | 2.0         | Source sampling rate for creating per‑window inputs.                                                                       |
| `window_size`           | int                                  | 256         | Number of frames per window before captioning.                                                                             |
| `remainder_threshold`   | int                                  | 128         | Minimum leftover frames required to create a final shorter window.                                                         |
| `model_does_preprocess` | bool                                 | `False`     | Whether the downstream model performs its own preprocessing.                                                               |
| `preprocess_dtype`      | str                                  | `"float32"` | Data type for any preprocessing performed here.                                                                            |
| `generate_previews`     | bool                                 | `True`      | When `True`, return per‑window `mp4` bytes to enable preview generation.                                                   |
| `verbose`               | bool                                 | `False`     | Log additional setup and per‑clip details.                                                                                 |

| Parameter             | Type        | Default | Description                                                          |
| --------------------- | ----------- | ------- | -------------------------------------------------------------------- |
| `target_fps`          | float       | 1.0     | Frames per second for preview encoding.                              |
| `target_height`       | int         | 240     | Output height in pixels; width auto‑scales to preserve aspect ratio. |
| `compression_level`   | int (0–6)   | 6       | WebP compression level (`0` = lossless, higher = smaller files).     |
| `quality`             | int (0–100) | 50      | WebP quality factor (`100` = best quality, larger files).            |
| `num_cpus_per_worker` | float       | 4.0     | CPU threads mapped to `ffmpeg -threads` for encoding.                |
| `verbose`             | bool        | `False` | Log warnings and per‑window encoding details.                        |

## Caption generation and enhancement

1. Generate window‑level captions with the chosen VLM (Qwen‑VL or Nemotron). This stage reads `clip.windows[*].qwen_llm_input` (created earlier) and writes `window.caption["qwen"]` (or `window.caption["nemotron"]`, depending on the variant).

   ```python
   from nemo_curator.stages.video.caption.caption_generation import CaptionGenerationStage
   from nemo_curator.stages.video.caption.caption_enhancement import CaptionEnhancementStage

   gen = CaptionGenerationStage(
       model_dir="/models",
       model_variant="qwen",  # or "nemotron" / "nemotron-fp8" / ...
       caption_batch_size=8,
       fp8=False,
       max_output_tokens=512,
       model_does_preprocess=False,
       generate_stage2_caption=False,
       stage2_prompt_text=None,
       disable_mmcache=True,
   )

   ```

2. Optionally enhance captions with a text‑based LLM (Qwen‑LM) to expand and refine descriptions. This stage reads `window.caption["qwen"]` and writes `window.enhanced_caption["qwen_lm"]`.

   ```python
   enh = CaptionEnhancementStage(
       model_dir="/models",
       model_variant="qwen",
       prompt_variant="default",
       prompt_text=None,
       model_batch_size=128,
       fp8=False,
       max_output_tokens=512,
       verbose=True,
   )
   ```

### Parameters

| Parameter                 | Type        | Default         | Description                                                                                                  |
| ------------------------- | ----------- | --------------- | ------------------------------------------------------------------------------------------------------------ |
| `model_dir`               | str         | `"models/qwen"` | Directory for model weights; downloaded on each node if missing.                                             |
| `model_variant`           | str         | `"qwen"`        | Vision‑language model variant. One of `qwen`, `nemotron`, `nemotron-bf16`, `nemotron-fp8`, `nemotron-nvfp4`. |
| `caption_batch_size`      | int         | 16              | Batch size for caption generation.                                                                           |
| `fp8`                     | bool        | `False`         | Use FP8 weights when available.                                                                              |
| `max_output_tokens`       | int         | 512             | Maximum number of tokens to generate per caption.                                                            |
| `model_does_preprocess`   | bool        | `False`         | Whether the model performs its own preprocessing.                                                            |
| `disable_mmcache`         | bool        | `False`         | Disable multimodal cache for generation backends that support it.                                            |
| `generate_stage2_caption` | bool        | `False`         | Enable a second‑pass caption for refinement.                                                                 |
| `stage2_prompt_text`      | str \| None | `None`          | Custom prompt for stage‑2 caption refinement.                                                                |
| `verbose`                 | bool        | `False`         | Emit additional logs during generation.                                                                      |

| Parameter           | Type                           | Default         | Description                                                           |
| ------------------- | ------------------------------ | --------------- | --------------------------------------------------------------------- |
| `model_dir`         | str                            | `"models/qwen"` | Directory for language‑model weights; downloaded per node if missing. |
| `model_variant`     | {"qwen"}                       | `"qwen"`        | Language‑model variant.                                               |
| `prompt_variant`    | {"default", "av-surveillance"} | `"default"`     | Built‑in enhancement prompt when `prompt_text` is not provided.       |
| `prompt_text`       | str \| None                    | `None`          | Custom enhancement prompt. When set, overrides `prompt_variant`.      |
| `model_batch_size`  | int                            | 128             | Batch size for enhancement generation.                                |
| `fp8`               | bool                           | `False`         | Use FP8 weights when available.                                       |
| `max_output_tokens` | int                            | 512             | Maximum number of tokens to generate per enhanced caption.            |
| `verbose`           | bool                           | `False`         | Emit additional logs during enhancement.                              |

## Preview Generation

Generate lightweight `.webp` previews for each caption window to support review and QA workflows. A dedicated `PreviewStage` reads per-window `mp4` bytes and encodes WebP using `ffmpeg`.

### Preview Parameters

* `target_fps` (default `1.0`): Target frames per second for preview generation.
* `target_height` (default `240`): Output height. Width auto-scales to preserve aspect ratio.
* `compression_level` (range `0–6`, default `6`): WebP compression level. `0` is lossless; higher values reduce size with lower quality.
* `quality` (range `0–100`, default `50`): WebP quality. Higher values increase quality and size.
* `num_cpus_per_worker` (default `4.0`): Number of CPU threads mapped to `ffmpeg -threads`.
* `verbose` (default `False`): Emit more logs.

Behavior notes:

* If the input frame rate is lower than `target_fps` or the input height is lower than `target_height`, the stage logs a warning and preview quality can degrade.
* If `ffmpeg` fails, the stage logs the error and skips assigning preview bytes for that window.

### Example: Configure PreviewStage

```python
from nemo_curator.stages.video.preview.preview import PreviewStage

preview = PreviewStage(
    target_fps=1.0,
    target_height=240,
    compression_level=6,
    quality=50,
    num_cpus_per_worker=4.0,
    verbose=False,
)
```

### Outputs

The stage writes `.webp` files under the `previews/` directory that `ClipWriterStage` manages. Use the helper to resolve the path:

```python
from nemo_curator.stages.video.io.clip_writer import ClipWriterStage
previews_dir = ClipWriterStage.get_output_path_previews("/outputs")
```

Refer to Save & Export for directory structure and file locations: [Save & Export](/curate-video/save-export).

### Requirements and Troubleshooting

* `ffmpeg` with WebP (`libwebp`) support must be available in the environment.
* If you observe warnings about low frame rate or height, consider lowering `target_fps` or `target_height` to better match inputs.
* On encoding errors, check logs for the `ffmpeg` command and output to diagnose missing encoders.

{/* end */}