nemo_curator.stages.video.caption.caption_enhancement

View as Markdown

Module Contents

Classes

NameDescription
CaptionEnhancementStageStage that enhances video captions using language models.

Functions

NameDescription
_get_enhance_prompt-

Data

_ENHANCE_PROMPTS

API

class nemo_curator.stages.video.caption.caption_enhancement.CaptionEnhancementStage(
model_dir: str = 'models/qwen',
model_variant: str = 'qwen',
prompt_variant: str = 'default',
prompt_text: str | None = None,
model_batch_size: int = 128,
fp8: bool = False,
max_output_tokens: int = 512,
verbose: bool = False,
name: str = 'caption_enhancement'
)
Dataclass

Bases: ProcessingStage[VideoTask, VideoTask]

Stage that enhances video captions using language models.

This stage takes existing captions and uses LLM (e.g. Qwen) to generate more detailed and refined descriptions of the video content.

fp8
bool = False
max_output_tokens
int = 512
model_batch_size
int = 128
model_dir
str = 'models/qwen'
model_variant
str = 'qwen'
name
str = 'caption_enhancement'
prompt_text
str | None = None
prompt_variant
str = 'default'
verbose
bool = False
nemo_curator.stages.video.caption.caption_enhancement.CaptionEnhancementStage.__post_init__() -> None
nemo_curator.stages.video.caption.caption_enhancement.CaptionEnhancementStage._generate_and_assign_captions(
video: nemo_curator.tasks.video.Video,
mapping: dict[int, tuple[int, int]],
inputs: list[dict[str, typing.Any]]
) -> None

Generate enhanced captions and assign them to video windows.

nemo_curator.stages.video.caption.caption_enhancement.CaptionEnhancementStage._initialize_model() -> None
nemo_curator.stages.video.caption.caption_enhancement.CaptionEnhancementStage._is_valid_window_caption(
clip: nemo_curator.tasks.video.Clip,
window: nemo_curator.tasks.video._Window,
window_idx: int
) -> bool

Check if window has valid caption data.

nemo_curator.stages.video.caption.caption_enhancement.CaptionEnhancementStage._prepare_caption_inputs(
video: nemo_curator.tasks.video.Video
) -> tuple[dict[int, tuple[int, int]], list[dict[str, typing.Any]]]

Prepare caption inputs from video clips and windows.

nemo_curator.stages.video.caption.caption_enhancement.CaptionEnhancementStage.inputs() -> tuple[list[str], list[str]]
nemo_curator.stages.video.caption.caption_enhancement.CaptionEnhancementStage.outputs() -> tuple[list[str], list[str]]
nemo_curator.stages.video.caption.caption_enhancement.CaptionEnhancementStage.process(
task: nemo_curator.tasks.video.VideoTask
) -> nemo_curator.tasks.video.VideoTask
nemo_curator.stages.video.caption.caption_enhancement.CaptionEnhancementStage.setup(
worker_metadata: nemo_curator.backends.base.WorkerMetadata | None = None
) -> None
nemo_curator.stages.video.caption.caption_enhancement.CaptionEnhancementStage.setup_on_node(
node_info: nemo_curator.backends.base.NodeInfo,
worker_metadata: nemo_curator.backends.base.WorkerMetadata
) -> None

Download weights and initialize vLLM once per node to avoid torch.compile race conditions.

nemo_curator.stages.video.caption.caption_enhancement._get_enhance_prompt(
prompt_variant: str,
prompt_text: str | None,
verbose: bool = False
) -> str
nemo_curator.stages.video.caption.caption_enhancement._ENHANCE_PROMPTS = {'default': '\n You are a chatbot that enhances video caption inputs, add...