nemo_automodel.components.models.llama_nemotron_vl.processor

Module Contents

Classes

Name	Description
`Conversation`	Manages prompt construction with system messages and multi-turn dialogues.
`LlamaNemotronVLImageProcessor`	Fast batched image processor for Llama Nemotron VL retrieval inputs.
`LlamaNemotronVLProcessor`	Processor for LlamaNemotronVL model.
`LlamaNemotronVLProcessorConfig`	Dummy Configuration for LlamaNemotronVLProcessor,

Functions

Name	Description
`_register_with_hf_auto_classes`	-
`dynamic_preprocess`	Dynamically preprocess an image into a list of image tiles, with a thumbnail if needed.
`find_closest_aspect_ratio`	previous version mainly foucs on ratio.
`get_conv_template`	Initialize a conversation instance with default configuration.
`load_image`	Load an image from a file, a URL, a base64 string, or a bytes object.

Data

API

class nemo_automodel.components.models.llama_nemotron_vl.processor.Conversation(
    system_message: str = '',
    roles: typing.Tuple[str, str] = ('', ''),
    messages: typing.List[typing.List[str]] = list(),
    sep: str = '',
    stop_token_ids: typing.List[int] = None
)

Dataclass

Manages prompt construction with system messages and multi-turn dialogues.

messages

List[List[str]] = field(default_factory=list)

roles

Tuple[str, str] = ('', '')

sep

str = ''

stop_token_ids

List[int] = None

system_message

str = ''

nemo_automodel.components.models.llama_nemotron_vl.processor.Conversation.append_message(
    role: str,
    message: str
)

Add a message turn to the dialogue history.

nemo_automodel.components.models.llama_nemotron_vl.processor.Conversation.get_prompt() -> str

Construct the formatted prompt string from system message and dialogue history.

class nemo_automodel.components.models.llama_nemotron_vl.processor.LlamaNemotronVLImageProcessor(
    image_size: int = 512,
    max_num_tiles: int = 6,
    use_thumbnail: bool = True,
    dynamic_image_size: bool = True,
    norm_type: str = 'siglip',
    resample: typing.Optional[typing.Union[transformers.image_utils.PILImageResampling, int]] = None,
    kwargs = {}
)

Bases: BaseImageProcessorFast

Fast batched image processor for Llama Nemotron VL retrieval inputs.

model_input_names

= ['pixel_values']

nemo_automodel.components.models.llama_nemotron_vl.processor.LlamaNemotronVLImageProcessor._preprocess(
    images: transformers.image_utils.ImageInput,
    image_size: typing.Optional[int] = None,
    max_num_tiles: typing.Optional[int] = None,
    use_thumbnail: typing.Optional[bool] = None,
    dynamic_image_size: typing.Optional[bool] = None,
    do_rescale: typing.Optional[bool] = None,
    rescale_factor: typing.Optional[float] = None,
    do_normalize: typing.Optional[bool] = None,
    image_mean: typing.Optional[typing.Union[float, typing.List[float]]] = None,
    image_std: typing.Optional[typing.Union[float, typing.List[float]]] = None,
    resample: typing.Optional[typing.Union[transformers.image_utils.PILImageResampling, int]] = None,
    return_tensors: typing.Optional[typing.Union[str, transformers.utils.TensorType]] = None,
    kwargs = {}
) -> transformers.image_processing_base.BatchFeature

nemo_automodel.components.models.llama_nemotron_vl.processor.LlamaNemotronVLImageProcessor.dynamic_preprocess(
    image: torch.Tensor,
    image_size: int = 512,
    max_num_tiles: int = 6,
    use_thumbnail: bool = True,
    resample: typing.Optional[typing.Union[transformers.image_utils.PILImageResampling, int]] = None
) -> typing.List[torch.Tensor]

Split one channel-first image tensor into dynamically sized square tiles.

class nemo_automodel.components.models.llama_nemotron_vl.processor.LlamaNemotronVLProcessor(
    tokenizer: typing.Any,
    config: typing.Optional[nemo_automodel.components.models.llama_nemotron_vl.processor.LlamaNemotronVLProcessorConfig] = None,
    q_max_length: typing.Optional[int] = None,
    p_max_length: typing.Optional[int] = None,
    pad_to_multiple_of: typing.Optional[int] = None,
    query_prefix: str = 'query:',
    passage_prefix: str = 'passage:',
    max_input_tiles: int = 6,
    num_image_token: int = 256,
    dynamic_image_size: bool = True,
    image_size: int = 512,
    use_thumbnail: bool = True,
    template: str = 'bidirectional-llama-retrie...,
    num_channels: int = 3,
    norm_type: str = 'siglip',
    system_message: str = '',
    padding: typing.Union[bool, str] = True,
    kwargs = {}
)

Bases: ProcessorMixin

Processor for LlamaNemotronVL model.

attributes

= ['tokenizer']

image_processor

tokenizer_class

= 'AutoTokenizer'

nemo_automodel.components.models.llama_nemotron_vl.processor.LlamaNemotronVLProcessor.__call__(
    text: typing.Optional[typing.List[str]] = None,
    images: typing.Optional[typing.List[typing.Any]] = None,
    text_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None,
    images_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None,
    common_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None,
    kwargs = {}
) -> typing.Dict[str, typing.Any]

Process text and/or image inputs into model-ready features. This method provides compatibility with the standard HuggingFace processor interface used by Sentence Transformers. For image inputs, it delegates to process_documents. For text-only inputs, it tokenizes directly (assuming any task prefix has already been applied by the caller). Args: text: List of text strings. For text-only inputs, these should already include any task prefix (e.g. “query: ” or “passage: ”). images: List of PIL Images for document encoding. text_kwargs: Keyword arguments for text processing (e.g. padding, truncation). images_kwargs: Keyword arguments for image processing (unused, for API compat). common_kwargs: Common keyword arguments (e.g. return_tensors). **kwargs: Additional keyword arguments (ignored). Returns: Dict with “input_ids”, “attention_mask”, and optionally “pixel_values”.

nemo_automodel.components.models.llama_nemotron_vl.processor.LlamaNemotronVLProcessor.add_dummy_labels(
    questions,
    merged_batch_dict
)

nemo_automodel.components.models.llama_nemotron_vl.processor.LlamaNemotronVLProcessor.merge_batch_dict(
    query_batch_dict,
    doc_batch_dict
)

nemo_automodel.components.models.llama_nemotron_vl.processor.LlamaNemotronVLProcessor.process_documents(
    documents: typing.Union[typing.Dict, typing.List[typing.Dict]],
    return_tensors: typing.Literal['pt', 'np'] = 'pt',
    padding: bool | str | None = None,
    truncation: bool = True,
    pixel_values_layout: typing.Literal['per_image', 'flat_tiles'] = 'flat_tiles',
    kwargs = {}
) -> typing.Dict[str, typing.Any]

Process documents into model inputs with tokenized text and pixel values. Args: documents: Either a dict with “images” and “texts” lists, or a list of dicts each with “image” and “text” keys. Images can be PIL Images, file paths, or None/empty string for text-only documents. return_tensors: Output format — “pt” for PyTorch tensors, “np” for numpy arrays. padding: Padding strategy passed to the tokenizer. Defaults to the value set in the processor constructor. truncation: Whether to truncate sequences to p_max_length. pixel_values_layout: How to structure the pixel values output:

“flat_tiles”: All image tiles concatenated into a single tensor of shape (total_tiles, C, H, W). Different images may contribute different numbers of tiles. None if no images are present. This is the format expected by the model’s forward() method.
“per_image”: A list aligned with the input documents, where each entry is either a tensor of shape (num_tiles, C, H, W) or None. Returns: Dict with “input_ids”, “attention_mask”, and “pixel_values”.

nemo_automodel.components.models.llama_nemotron_vl.processor.LlamaNemotronVLProcessor.process_queries(
    queries: typing.List[str],
    return_tensors: typing.Literal['pt', 'np'] = 'pt',
    padding: bool | str | None = None,
    truncation: bool = True,
    kwargs = {}
) -> transformers.BatchEncoding

Process queries into model inputs with tokenized text. Args: queries: List of query strings. return_tensors: Output format — “pt” for PyTorch tensors, “np” for numpy arrays. padding: Padding strategy passed to the tokenizer. Defaults to the value set in the processor constructor. truncation: Whether to truncate sequences to q_max_length. Returns: Dict with “input_ids” and “attention_mask”.

nemo_automodel.components.models.llama_nemotron_vl.processor.LlamaNemotronVLProcessor.process_queries_documents_biencoder(
    features: typing.Dict,
    kwargs = {}
) -> typing.Dict[str, typing.Any]

(Pdb) features [{‘image’: [<PIL.Image.Image image mode=RGB size=1275x1650 at 0x155059A5C3A0>, <PIL.Image.Image image mode=RGB size=1275x1650 at 0x155059A5C580>, <PIL.Image.Image image mode=RGB size=1275x1650 at 0x155059A5C940>], ‘text’: [‘passage: ’, ‘passage: ’, ‘passage: ’], ‘question’: “query: What change did Carl Rey suggest for the Strategic Plan’s website objective deadline?”}, {‘image’: [<PIL.Image.Image image mode=RGB size=1275x1650 at 0x155059A5C0D0>, <PIL.Image.Image image mode=RGB size=1275x1650 at 0x155059A5DC00>, <PIL.Image.Image image mode=RGB size=1275x1650 at 0x155059A5EBF0>], ‘text’: [‘passage: ’, ‘passage: ’, ‘passage: ’], ‘question’: ‘query: What are the name and TIN requirements for individuals with real estate transactions?’}, {‘image’: [<PIL.Image.Image image mode=RGB size=1275x1650 at 0x155059A5D390>, <PIL.Image.Image image mode=RGB size=1275x1650 at 0x155059A5C850>, <PIL.Image.Image image mode=RGB size=1275x1650 at 0x155059A5C070>], ‘text’: [‘passage: ’, ‘passage: ’, ‘passage: ’], ‘question’: ‘query: How does Richard Hooker view human inclinations?’}]

class nemo_automodel.components.models.llama_nemotron_vl.processor.LlamaNemotronVLProcessorConfig()

Bases: PretrainedConfig

Dummy Configuration for LlamaNemotronVLProcessor, just to register the processor with AutoProcessor.

nemo_automodel.components.models.llama_nemotron_vl.processor._register_with_hf_auto_classes()

nemo_automodel.components.models.llama_nemotron_vl.processor.dynamic_preprocess(
    image,
    min_num = 1,
    max_num = 6,
    image_size = 448,
    use_thumbnail = False
)

Dynamically preprocess an image into a list of image tiles, with a thumbnail if needed.

nemo_automodel.components.models.llama_nemotron_vl.processor.find_closest_aspect_ratio(
    aspect_ratio,
    target_ratios,
    width,
    height,
    image_size
)

previous version mainly foucs on ratio. We also consider area ratio here.

nemo_automodel.components.models.llama_nemotron_vl.processor.get_conv_template(
    name: str
) -> nemo_automodel.components.models.llama_nemotron_vl.processor.Conversation

Initialize a conversation instance with default configuration.

nemo_automodel.components.models.llama_nemotron_vl.processor.load_image(
    image
)

Load an image from a file, a URL, a base64 string, or a bytes object.

nemo_automodel.components.models.llama_nemotron_vl.processor.IMAGENET_MEAN = (0.485, 0.456, 0.406)

nemo_automodel.components.models.llama_nemotron_vl.processor.IMAGENET_STD = (0.229, 0.224, 0.225)

nemo_automodel.components.models.llama_nemotron_vl.processor.SIGLIP_MEAN = (0.5, 0.5, 0.5)

nemo_automodel.components.models.llama_nemotron_vl.processor.SIGLIP_STD = (0.5, 0.5, 0.5)