nemo_curator.models.prompt_formatter

View as Markdown

Module Contents

Classes

NameDescription
PromptFormatter-

Data

VARIANT_MAPPING

API

class nemo_curator.models.prompt_formatter.PromptFormatter(
prompt_variant: str
)
processor
nemo_curator.models.prompt_formatter.PromptFormatter.create_message(
prompt: str
) -> list[dict[str, typing.Any]]

Create a message.

Parameters:

text_input

The text input to create a message for.

Returns: list[dict[str, Any]]

List of messages for the VLM model including the text prompt and video.

nemo_curator.models.prompt_formatter.PromptFormatter.generate_inputs(
prompt: str,
video_inputs: torch.Tensor | None = None,
override_text_prompt: bool = False
) -> dict[str, typing.Any]

Generate inputs for video and text data based on prompt_variant.

Processes video and text inputs to create the input for the model. It handles both video and image inputs, decoding video and applying preprocessing if needed, and creates a structured input dictionary containing the processed prompt and multimodal data.

Parameters:

prompt
str

Text prompt to be included with the input.

fps

Frames per second of the input video.

preprocess_dtype

Data type to use for preprocessing the video/image inputs.

num_frames_to_use

Number of frames to extract from the video. If 0, uses all frames.

flip_input

Whether to flip the input video/image horizontally.

video_inputs
torch.Tensor | NoneDefaults to None

Pre-processed video inputs. If None, and video data is to be passed to the model, then video cannot be None.

override_text_prompt
boolDefaults to False

whether the text prompt should be overridden

Returns: dict[str, Any]

dict containing:

  • “prompt”: The processed text prompt with chat template applied
  • “multi_modal_data”: Dictionary containing processed “image” and/or “video” inputs
nemo_curator.models.prompt_formatter.VARIANT_MAPPING = {'qwen': 'Qwen/Qwen2.5-VL-7B-Instruct'}