***

description: Generate clip captions with Qwen and optional preview images
categories:

* video-curation
  tags:
* captions
* qwen
* preview
* video
  personas:
* data-scientist-focused
* mle-focused
  difficulty: intermediate
  content\_type: howto
  modality: video-only

***

# Captions and Preview

Prepare inputs, generate captions, optionally enhance them, and produce preview images.

***

## Quickstart

Use the pipeline stages or the example script flags to prepare captions and preview images.

<Tabs>
  <Tab title="Pipeline Stage">
    ```python
    from nemo_curator.pipeline import Pipeline
    from nemo_curator.stages.video.caption.caption_preparation import CaptionPreparationStage
    from nemo_curator.stages.video.caption.caption_generation import CaptionGenerationStage
    from nemo_curator.stages.video.caption.caption_enhancement import CaptionEnhancementStage
    from nemo_curator.stages.video.preview.preview import PreviewStage

    pipe = Pipeline(name="captions_preview")
    pipe.add_stage(
        CaptionPreparationStage(
            model_variant="qwen",
            prompt_variant="default",
            prompt_text=None,
            sampling_fps=2.0,
            window_size=256,
            remainder_threshold=128,
            preprocess_dtype="float16",
            model_does_preprocess=False,
            generate_previews=True,
            verbose=True,
        )
    )
    pipe.add_stage(PreviewStage(target_fps=1.0, target_height=240, verbose=True))
    pipe.add_stage(
        CaptionGenerationStage(
            model_dir="/models",
            model_variant="qwen",
            caption_batch_size=8,
            fp8=False,
            max_output_tokens=512,
            model_does_preprocess=False,
            generate_stage2_caption=False,
            stage2_prompt_text=None,
            disable_mmcache=True,
        )
    )
    pipe.run()
    ```
  </Tab>

  <Tab title="Script Flags">
    ```bash
    python tutorials/video/getting-started/video_split_clip_example.py \
      ... \
      --generate-captions \
      --captioning-algorithm qwen \
      --captioning-window-size 256 \
      --captioning-remainder-threshold 128 \
      --captioning-sampling-fps 2.0 \
      --captioning-preprocess-dtype float16 \
      --captioning-batch-size 8 \
      --captioning-max-output-tokens 512 \
      --generate-previews \
      --preview-target-fps 1.0 \
      --preview-target-height 240
    ```
  </Tab>
</Tabs>

## Preparation and previews

1. Prepare caption inputs from each clip window. This step splits clips into fixed windows, formats model‑ready inputs for Qwen‑VL, and optionally stores per‑window `mp4` bytes for previews.

   ```python
   from nemo_curator.stages.video.caption.caption_preparation import CaptionPreparationStage
   from nemo_curator.stages.video.preview.preview import PreviewStage

   prep = CaptionPreparationStage(
       model_variant="qwen",
       prompt_variant="default",
       prompt_text=None,
       sampling_fps=2.0,
       window_size=256,
       remainder_threshold=128,
       preprocess_dtype="float16",
       model_does_preprocess=False,
       generate_previews=True,
       verbose=True,
   )
   ```

2. Optionally generate `.webp` previews from each window’s `mp4` bytes for quick QA and review.

   ```python
   preview = PreviewStage(
       target_fps=1.0,
       target_height=240,
       verbose=True,
   )
   ```

### Parameters

<Tabs>
  <Tab title="CaptionPreparationStage">
    | Parameter               | Type                                 | Default     | Description                                                                    |
    | ----------------------- | ------------------------------------ | ----------- | ------------------------------------------------------------------------------ |
    | `model_variant`         | str                                  | `"qwen"`    | Vision‑language model used to format inputs for captioning (currently `qwen`). |
    | `prompt_variant`        | {"default", "av", "av-surveillance"} | `"default"` | Built‑in prompt to steer caption content when `prompt_text` is not provided.   |
    | `prompt_text`           | str \| None                          | `None`      | Custom prompt text. When set, overrides `prompt_variant`.                      |
    | `sampling_fps`          | float                                | 2.0         | Source sampling rate for creating per‑window inputs.                           |
    | `window_size`           | int                                  | 256         | Number of frames per window before captioning.                                 |
    | `remainder_threshold`   | int                                  | 128         | Minimum leftover frames required to create a final shorter window.             |
    | `model_does_preprocess` | bool                                 | `False`     | Whether the downstream model performs its own preprocessing.                   |
    | `preprocess_dtype`      | str                                  | `"float32"` | Data type for any preprocessing performed here.                                |
    | `generate_previews`     | bool                                 | `True`      | When `True`, return per‑window `mp4` bytes to enable preview generation.       |
    | `verbose`               | bool                                 | `False`     | Log additional setup and per‑clip details.                                     |
  </Tab>

  <Tab title="PreviewStage">
    | Parameter             | Type        | Default | Description                                                          |
    | --------------------- | ----------- | ------- | -------------------------------------------------------------------- |
    | `target_fps`          | float       | 1.0     | Frames per second for preview encoding.                              |
    | `target_height`       | int         | 240     | Output height in pixels; width auto‑scales to preserve aspect ratio. |
    | `compression_level`   | int (0–6)   | 6       | WebP compression level (`0` = lossless, higher = smaller files).     |
    | `quality`             | int (0–100) | 50      | WebP quality factor (`100` = best quality, larger files).            |
    | `num_cpus_per_worker` | float       | 4.0     | CPU threads mapped to `ffmpeg -threads` for encoding.                |
    | `verbose`             | bool        | `False` | Log warnings and per‑window encoding details.                        |
  </Tab>
</Tabs>

## Caption generation and enhancement

1. Generate window‑level captions with a vision‑language model (Qwen‑VL). This stage reads `clip.windows[*].qwen_llm_input` created earlier and writes `window.caption["qwen"]`.

   ```python
   from nemo_curator.stages.video.caption.caption_generation import CaptionGenerationStage
   from nemo_curator.stages.video.caption.caption_enhancement import CaptionEnhancementStage

   gen = CaptionGenerationStage(
       model_dir="/models",
       model_variant="qwen",
       caption_batch_size=8,
       fp8=False,
       max_output_tokens=512,
       model_does_preprocess=False,
       generate_stage2_caption=False,
       stage2_prompt_text=None,
       disable_mmcache=True,
   )

   ```

2. Optionally enhance captions with a text‑based LLM (Qwen‑LM) to expand and refine descriptions. This stage reads `window.caption["qwen"]` and writes `window.enhanced_caption["qwen_lm"]`.

   ```python
   enh = CaptionEnhancementStage(
       model_dir="/models",
       model_variant="qwen",
       prompt_variant="default",
       prompt_text=None,
       model_batch_size=128,
       fp8=False,
       max_output_tokens=512,
       verbose=True,
   )
   ```

### Parameters

<Tabs>
  <Tab title="CaptionGenerationStage">
    | Parameter                 | Type        | Default         | Description                                                       |
    | ------------------------- | ----------- | --------------- | ----------------------------------------------------------------- |
    | `model_dir`               | str         | `"models/qwen"` | Directory for model weights; downloaded on each node if missing.  |
    | `model_variant`           | {"qwen"}    | `"qwen"`        | Vision‑language model variant.                                    |
    | `caption_batch_size`      | int         | 16              | Batch size for caption generation.                                |
    | `fp8`                     | bool        | `False`         | Use FP8 weights when available.                                   |
    | `max_output_tokens`       | int         | 512             | Maximum number of tokens to generate per caption.                 |
    | `model_does_preprocess`   | bool        | `False`         | Whether the model performs its own preprocessing.                 |
    | `disable_mmcache`         | bool        | `False`         | Disable multimodal cache for generation backends that support it. |
    | `generate_stage2_caption` | bool        | `False`         | Enable a second‑pass caption for refinement.                      |
    | `stage2_prompt_text`      | str \| None | `None`          | Custom prompt for stage‑2 caption refinement.                     |
    | `verbose`                 | bool        | `False`         | Emit additional logs during generation.                           |
  </Tab>

  <Tab title="CaptionEnhancementStage">
    | Parameter           | Type                           | Default         | Description                                                           |
    | ------------------- | ------------------------------ | --------------- | --------------------------------------------------------------------- |
    | `model_dir`         | str                            | `"models/qwen"` | Directory for language‑model weights; downloaded per node if missing. |
    | `model_variant`     | {"qwen"}                       | `"qwen"`        | Language‑model variant.                                               |
    | `prompt_variant`    | {"default", "av-surveillance"} | `"default"`     | Built‑in enhancement prompt when `prompt_text` is not provided.       |
    | `prompt_text`       | str \| None                    | `None`          | Custom enhancement prompt. When set, overrides `prompt_variant`.      |
    | `model_batch_size`  | int                            | 128             | Batch size for enhancement generation.                                |
    | `fp8`               | bool                           | `False`         | Use FP8 weights when available.                                       |
    | `max_output_tokens` | int                            | 512             | Maximum number of tokens to generate per enhanced caption.            |
    | `verbose`           | bool                           | `False`         | Emit additional logs during enhancement.                              |
  </Tab>
</Tabs>

## Preview Generation

Generate lightweight `.webp` previews for each caption window to support review and QA workflows. A dedicated `PreviewStage` reads per-window `mp4` bytes and encodes WebP using `ffmpeg`.

### Preview Parameters

* `target_fps` (default `1.0`): Target frames per second for preview generation.
* `target_height` (default `240`): Output height. Width auto-scales to preserve aspect ratio.
* `compression_level` (range `0–6`, default `6`): WebP compression level. `0` is lossless; higher values reduce size with lower quality.
* `quality` (range `0–100`, default `50`): WebP quality. Higher values increase quality and size.
* `num_cpus_per_worker` (default `4.0`): Number of CPU threads mapped to `ffmpeg -threads`.
* `verbose` (default `False`): Emit more logs.

Behavior notes:

* If the input frame rate is lower than `target_fps` or the input height is lower than `target_height`, the stage logs a warning and preview quality can degrade.
* If `ffmpeg` fails, the stage logs the error and skips assigning preview bytes for that window.

### Example: Configure PreviewStage

```python
from nemo_curator.stages.video.preview.preview import PreviewStage

preview = PreviewStage(
    target_fps=1.0,
    target_height=240,
    compression_level=6,
    quality=50,
    num_cpus_per_worker=4.0,
    verbose=False,
)
```

### Outputs

The stage writes `.webp` files under the `previews/` directory that `ClipWriterStage` manages. Use the helper to resolve the path:

```python
from nemo_curator.stages.video.io.clip_writer import ClipWriterStage
previews_dir = ClipWriterStage.get_output_path_previews("/outputs")
```

Refer to Save & Export for directory structure and file locations: [Save & Export](/curate-video/save-export).

### Requirements and Troubleshooting

* `ffmpeg` with WebP (`libwebp`) support must be available in the environment.
* If you observe warnings about low frame rate or height, consider lowering `target_fps` or `target_height` to better match inputs.
* On encoding errors, check logs for the `ffmpeg` command and output to diagnose missing encoders.

{/* end */}
