***
description: Generate clip captions with Qwen and optional preview images
categories:
* video-curation
tags:
* captions
* qwen
* preview
* video
personas:
* data-scientist-focused
* mle-focused
difficulty: intermediate
content\_type: howto
modality: video-only
***
# Captions and Preview
Prepare inputs, generate captions, optionally enhance them, and produce preview images.
***
## Quickstart
Use the pipeline stages or the example script flags to prepare captions and preview images.
```python
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.video.caption.caption_preparation import CaptionPreparationStage
from nemo_curator.stages.video.caption.caption_generation import CaptionGenerationStage
from nemo_curator.stages.video.caption.caption_enhancement import CaptionEnhancementStage
from nemo_curator.stages.video.preview.preview import PreviewStage
pipe = Pipeline(name="captions_preview")
pipe.add_stage(
CaptionPreparationStage(
model_variant="qwen",
prompt_variant="default",
prompt_text=None,
sampling_fps=2.0,
window_size=256,
remainder_threshold=128,
preprocess_dtype="float16",
model_does_preprocess=False,
generate_previews=True,
verbose=True,
)
)
pipe.add_stage(PreviewStage(target_fps=1.0, target_height=240, verbose=True))
pipe.add_stage(
CaptionGenerationStage(
model_dir="/models",
model_variant="qwen",
caption_batch_size=8,
fp8=False,
max_output_tokens=512,
model_does_preprocess=False,
generate_stage2_caption=False,
stage2_prompt_text=None,
disable_mmcache=True,
)
)
pipe.run()
```
```bash
python tutorials/video/getting-started/video_split_clip_example.py \
... \
--generate-captions \
--captioning-algorithm qwen \
--captioning-window-size 256 \
--captioning-remainder-threshold 128 \
--captioning-sampling-fps 2.0 \
--captioning-preprocess-dtype float16 \
--captioning-batch-size 8 \
--captioning-max-output-tokens 512 \
--generate-previews \
--preview-target-fps 1.0 \
--preview-target-height 240
```
## Preparation and previews
1. Prepare caption inputs from each clip window. This step splits clips into fixed windows, formats model‑ready inputs for Qwen‑VL, and optionally stores per‑window `mp4` bytes for previews.
```python
from nemo_curator.stages.video.caption.caption_preparation import CaptionPreparationStage
from nemo_curator.stages.video.preview.preview import PreviewStage
prep = CaptionPreparationStage(
model_variant="qwen",
prompt_variant="default",
prompt_text=None,
sampling_fps=2.0,
window_size=256,
remainder_threshold=128,
preprocess_dtype="float16",
model_does_preprocess=False,
generate_previews=True,
verbose=True,
)
```
2. Optionally generate `.webp` previews from each window’s `mp4` bytes for quick QA and review.
```python
preview = PreviewStage(
target_fps=1.0,
target_height=240,
verbose=True,
)
```
### Parameters
| Parameter | Type | Default | Description |
| ----------------------- | ------------------------------------ | ----------- | ------------------------------------------------------------------------------ |
| `model_variant` | str | `"qwen"` | Vision‑language model used to format inputs for captioning (currently `qwen`). |
| `prompt_variant` | {"default", "av", "av-surveillance"} | `"default"` | Built‑in prompt to steer caption content when `prompt_text` is not provided. |
| `prompt_text` | str \| None | `None` | Custom prompt text. When set, overrides `prompt_variant`. |
| `sampling_fps` | float | 2.0 | Source sampling rate for creating per‑window inputs. |
| `window_size` | int | 256 | Number of frames per window before captioning. |
| `remainder_threshold` | int | 128 | Minimum leftover frames required to create a final shorter window. |
| `model_does_preprocess` | bool | `False` | Whether the downstream model performs its own preprocessing. |
| `preprocess_dtype` | str | `"float32"` | Data type for any preprocessing performed here. |
| `generate_previews` | bool | `True` | When `True`, return per‑window `mp4` bytes to enable preview generation. |
| `verbose` | bool | `False` | Log additional setup and per‑clip details. |
| Parameter | Type | Default | Description |
| --------------------- | ----------- | ------- | -------------------------------------------------------------------- |
| `target_fps` | float | 1.0 | Frames per second for preview encoding. |
| `target_height` | int | 240 | Output height in pixels; width auto‑scales to preserve aspect ratio. |
| `compression_level` | int (0–6) | 6 | WebP compression level (`0` = lossless, higher = smaller files). |
| `quality` | int (0–100) | 50 | WebP quality factor (`100` = best quality, larger files). |
| `num_cpus_per_worker` | float | 4.0 | CPU threads mapped to `ffmpeg -threads` for encoding. |
| `verbose` | bool | `False` | Log warnings and per‑window encoding details. |
## Caption generation and enhancement
1. Generate window‑level captions with a vision‑language model (Qwen‑VL). This stage reads `clip.windows[*].qwen_llm_input` created earlier and writes `window.caption["qwen"]`.
```python
from nemo_curator.stages.video.caption.caption_generation import CaptionGenerationStage
from nemo_curator.stages.video.caption.caption_enhancement import CaptionEnhancementStage
gen = CaptionGenerationStage(
model_dir="/models",
model_variant="qwen",
caption_batch_size=8,
fp8=False,
max_output_tokens=512,
model_does_preprocess=False,
generate_stage2_caption=False,
stage2_prompt_text=None,
disable_mmcache=True,
)
```
2. Optionally enhance captions with a text‑based LLM (Qwen‑LM) to expand and refine descriptions. This stage reads `window.caption["qwen"]` and writes `window.enhanced_caption["qwen_lm"]`.
```python
enh = CaptionEnhancementStage(
model_dir="/models",
model_variant="qwen",
prompt_variant="default",
prompt_text=None,
model_batch_size=128,
fp8=False,
max_output_tokens=512,
verbose=True,
)
```
### Parameters
| Parameter | Type | Default | Description |
| ------------------------- | ----------- | --------------- | ----------------------------------------------------------------- |
| `model_dir` | str | `"models/qwen"` | Directory for model weights; downloaded on each node if missing. |
| `model_variant` | {"qwen"} | `"qwen"` | Vision‑language model variant. |
| `caption_batch_size` | int | 16 | Batch size for caption generation. |
| `fp8` | bool | `False` | Use FP8 weights when available. |
| `max_output_tokens` | int | 512 | Maximum number of tokens to generate per caption. |
| `model_does_preprocess` | bool | `False` | Whether the model performs its own preprocessing. |
| `disable_mmcache` | bool | `False` | Disable multimodal cache for generation backends that support it. |
| `generate_stage2_caption` | bool | `False` | Enable a second‑pass caption for refinement. |
| `stage2_prompt_text` | str \| None | `None` | Custom prompt for stage‑2 caption refinement. |
| `verbose` | bool | `False` | Emit additional logs during generation. |
| Parameter | Type | Default | Description |
| ------------------- | ------------------------------ | --------------- | --------------------------------------------------------------------- |
| `model_dir` | str | `"models/qwen"` | Directory for language‑model weights; downloaded per node if missing. |
| `model_variant` | {"qwen"} | `"qwen"` | Language‑model variant. |
| `prompt_variant` | {"default", "av-surveillance"} | `"default"` | Built‑in enhancement prompt when `prompt_text` is not provided. |
| `prompt_text` | str \| None | `None` | Custom enhancement prompt. When set, overrides `prompt_variant`. |
| `model_batch_size` | int | 128 | Batch size for enhancement generation. |
| `fp8` | bool | `False` | Use FP8 weights when available. |
| `max_output_tokens` | int | 512 | Maximum number of tokens to generate per enhanced caption. |
| `verbose` | bool | `False` | Emit additional logs during enhancement. |
## Preview Generation
Generate lightweight `.webp` previews for each caption window to support review and QA workflows. A dedicated `PreviewStage` reads per-window `mp4` bytes and encodes WebP using `ffmpeg`.
### Preview Parameters
* `target_fps` (default `1.0`): Target frames per second for preview generation.
* `target_height` (default `240`): Output height. Width auto-scales to preserve aspect ratio.
* `compression_level` (range `0–6`, default `6`): WebP compression level. `0` is lossless; higher values reduce size with lower quality.
* `quality` (range `0–100`, default `50`): WebP quality. Higher values increase quality and size.
* `num_cpus_per_worker` (default `4.0`): Number of CPU threads mapped to `ffmpeg -threads`.
* `verbose` (default `False`): Emit more logs.
Behavior notes:
* If the input frame rate is lower than `target_fps` or the input height is lower than `target_height`, the stage logs a warning and preview quality can degrade.
* If `ffmpeg` fails, the stage logs the error and skips assigning preview bytes for that window.
### Example: Configure PreviewStage
```python
from nemo_curator.stages.video.preview.preview import PreviewStage
preview = PreviewStage(
target_fps=1.0,
target_height=240,
compression_level=6,
quality=50,
num_cpus_per_worker=4.0,
verbose=False,
)
```
### Outputs
The stage writes `.webp` files under the `previews/` directory that `ClipWriterStage` manages. Use the helper to resolve the path:
```python
from nemo_curator.stages.video.io.clip_writer import ClipWriterStage
previews_dir = ClipWriterStage.get_output_path_previews("/outputs")
```
Refer to Save & Export for directory structure and file locations: [Save & Export](/curate-video/save-export).
### Requirements and Troubleshooting
* `ffmpeg` with WebP (`libwebp`) support must be available in the environment.
* If you observe warnings about low frame rate or height, consider lowering `target_fps` or `target_height` to better match inputs.
* On encoding errors, check logs for the `ffmpeg` command and output to diagnose missing encoders.
{/* end */}