nemo_curator.models.prompt_formatter

View as Markdown

Module Contents

Classes

NameDescription
PromptFormatterUnified prompt formatter for VLM models using HuggingFace AutoProcessor.

Data

VARIANT_MAPPING

API

class nemo_curator.models.prompt_formatter.PromptFormatter(
prompt_variant: str
)

Unified prompt formatter for VLM models using HuggingFace AutoProcessor.

Supports both Qwen and Nemotron model variants. Uses AutoProcessor.from_pretrained() to load the appropriate tokenizer and chat template from HuggingFace Hub or a local path.

processor
nemo_curator.models.prompt_formatter.PromptFormatter._convert_to_numpy(
video_inputs: torch.Tensor | numpy.ndarray
) -> numpy.ndarray

Convert video inputs to numpy array in (T, H, W, C) format.

nemo_curator.models.prompt_formatter.PromptFormatter._create_qwen_message(
prompt: str
) -> list[dict[str, typing.Any]]

Create a message for Qwen models.

nemo_curator.models.prompt_formatter.PromptFormatter._generate_nemotron_inputs(
prompt: str,
video_inputs: torch.Tensor | numpy.ndarray | None,
fps: float
) -> dict[str, typing.Any]

Generate inputs for Nemotron models.

Nemotron requires video metadata (fps, frames_indices) for vLLM processing.

nemo_curator.models.prompt_formatter.PromptFormatter._generate_qwen_inputs(
prompt: str,
video_inputs: torch.Tensor | None,
override_text_prompt: bool,
fps: float = 2.0
) -> dict[str, typing.Any]

Generate inputs for Qwen models.

nemo_curator.models.prompt_formatter.PromptFormatter.generate_inputs(
prompt: str,
video_inputs: torch.Tensor | numpy.ndarray | None = None,
override_text_prompt: bool = False,
fps: float = 2.0
) -> dict[str, typing.Any]

Generate inputs for video and text data based on prompt_variant.

Parameters:

prompt
str

Text prompt to be included with the input.

video_inputs
torch.Tensor | np.ndarray | NoneDefaults to None

Pre-processed video inputs (tensor or numpy array).

override_text_prompt
boolDefaults to False

Whether to regenerate the text prompt even if cached.

fps
floatDefaults to 2.0

Frames per second of the input video (used for Nemotron metadata).

Returns: dict[str, Any]

dict containing:

  • “prompt”: The processed text prompt with chat template applied
  • “multi_modal_data”: Dictionary containing processed “video” inputs
nemo_curator.models.prompt_formatter.VARIANT_MAPPING: dict[str, str] = {'qwen2.5': 'Qwen/Qwen2.5-VL-7B-Instruct', 'qwen3': 'Qwen/Qwen3-VL-8B-Instruct',...