nemo_automodel.components.models.llama_nemotron_vl.processor

View as Markdown

Module Contents

Classes

NameDescription
ConversationManages prompt construction with system messages and multi-turn dialogues.
LlamaNemotronVLImageProcessorFast batched image processor for Llama Nemotron VL retrieval inputs.
LlamaNemotronVLProcessorProcessor for LlamaNemotronVL model.
LlamaNemotronVLProcessorConfigDummy Configuration for LlamaNemotronVLProcessor,

Functions

NameDescription
_register_with_hf_auto_classes-
dynamic_preprocessDynamically preprocess an image into a list of image tiles, with a thumbnail if needed.
find_closest_aspect_ratioprevious version mainly foucs on ratio.
get_conv_templateInitialize a conversation instance with default configuration.
load_imageLoad an image from a file, a URL, a base64 string, or a bytes object.

Data

IMAGENET_MEAN

IMAGENET_STD

SIGLIP_MEAN

SIGLIP_STD

API

class nemo_automodel.components.models.llama_nemotron_vl.processor.Conversation(
system_message: str = '',
roles: typing.Tuple[str, str] = ('', ''),
messages: typing.List[typing.List[str]] = list(),
sep: str = '',
stop_token_ids: typing.List[int] = None
)
Dataclass

Manages prompt construction with system messages and multi-turn dialogues.

messages
List[List[str]] = field(default_factory=list)
roles
Tuple[str, str] = ('', '')
sep
str = ''
stop_token_ids
List[int] = None
system_message
str = ''
nemo_automodel.components.models.llama_nemotron_vl.processor.Conversation.append_message(
role: str,
message: str
)

Add a message turn to the dialogue history.

nemo_automodel.components.models.llama_nemotron_vl.processor.Conversation.get_prompt() -> str

Construct the formatted prompt string from system message and dialogue history.

class nemo_automodel.components.models.llama_nemotron_vl.processor.LlamaNemotronVLImageProcessor(
image_size: int = 512,
max_num_tiles: int = 6,
use_thumbnail: bool = True,
dynamic_image_size: bool = True,
norm_type: str = 'siglip',
resample: typing.Optional[typing.Union[transformers.image_utils.PILImageResampling, int]] = None,
kwargs = {}
)

Bases: BaseImageProcessorFast

Fast batched image processor for Llama Nemotron VL retrieval inputs.

model_input_names
= ['pixel_values']
nemo_automodel.components.models.llama_nemotron_vl.processor.LlamaNemotronVLImageProcessor._preprocess(
images: transformers.image_utils.ImageInput,
image_size: typing.Optional[int] = None,
max_num_tiles: typing.Optional[int] = None,
use_thumbnail: typing.Optional[bool] = None,
dynamic_image_size: typing.Optional[bool] = None,
do_rescale: typing.Optional[bool] = None,
rescale_factor: typing.Optional[float] = None,
do_normalize: typing.Optional[bool] = None,
image_mean: typing.Optional[typing.Union[float, typing.List[float]]] = None,
image_std: typing.Optional[typing.Union[float, typing.List[float]]] = None,
resample: typing.Optional[typing.Union[transformers.image_utils.PILImageResampling, int]] = None,
return_tensors: typing.Optional[typing.Union[str, transformers.utils.TensorType]] = None,
kwargs = {}
) -> transformers.image_processing_base.BatchFeature
nemo_automodel.components.models.llama_nemotron_vl.processor.LlamaNemotronVLImageProcessor.dynamic_preprocess(
image: torch.Tensor,
image_size: int = 512,
max_num_tiles: int = 6,
use_thumbnail: bool = True,
resample: typing.Optional[typing.Union[transformers.image_utils.PILImageResampling, int]] = None
) -> typing.List[torch.Tensor]

Split one channel-first image tensor into dynamically sized square tiles.

class nemo_automodel.components.models.llama_nemotron_vl.processor.LlamaNemotronVLProcessor(
tokenizer: typing.Any,
config: typing.Optional[nemo_automodel.components.models.llama_nemotron_vl.processor.LlamaNemotronVLProcessorConfig] = None,
q_max_length: typing.Optional[int] = None,
p_max_length: typing.Optional[int] = None,
pad_to_multiple_of: typing.Optional[int] = None,
query_prefix: str = 'query:',
passage_prefix: str = 'passage:',
max_input_tiles: int = 6,
num_image_token: int = 256,
dynamic_image_size: bool = True,
image_size: int = 512,
use_thumbnail: bool = True,
template: str = 'bidirectional-llama-retrie...,
num_channels: int = 3,
norm_type: str = 'siglip',
system_message: str = '',
padding: typing.Union[bool, str] = True,
kwargs = {}
)

Bases: ProcessorMixin

Processor for LlamaNemotronVL model.

attributes
= ['tokenizer']
image_processor
tokenizer_class
= 'AutoTokenizer'
nemo_automodel.components.models.llama_nemotron_vl.processor.LlamaNemotronVLProcessor.__call__(
text: typing.Optional[typing.List[str]] = None,
images: typing.Optional[typing.List[typing.Any]] = None,
text_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None,
images_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None,
common_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None,
kwargs = {}
) -> typing.Dict[str, typing.Any]

Process text and/or image inputs into model-ready features. This method provides compatibility with the standard HuggingFace processor interface used by Sentence Transformers. For image inputs, it delegates to process_documents. For text-only inputs, it tokenizes directly (assuming any task prefix has already been applied by the caller). Args: text: List of text strings. For text-only inputs, these should already include any task prefix (e.g. “query: ” or “passage: ”). images: List of PIL Images for document encoding. text_kwargs: Keyword arguments for text processing (e.g. padding, truncation). images_kwargs: Keyword arguments for image processing (unused, for API compat). common_kwargs: Common keyword arguments (e.g. return_tensors). **kwargs: Additional keyword arguments (ignored). Returns: Dict with “input_ids”, “attention_mask”, and optionally “pixel_values”.

nemo_automodel.components.models.llama_nemotron_vl.processor.LlamaNemotronVLProcessor.add_dummy_labels(
questions,
merged_batch_dict
)
nemo_automodel.components.models.llama_nemotron_vl.processor.LlamaNemotronVLProcessor.merge_batch_dict(
query_batch_dict,
doc_batch_dict
)
nemo_automodel.components.models.llama_nemotron_vl.processor.LlamaNemotronVLProcessor.process_documents(
documents: typing.Union[typing.Dict, typing.List[typing.Dict]],
return_tensors: typing.Literal['pt', 'np'] = 'pt',
padding: bool | str | None = None,
truncation: bool = True,
pixel_values_layout: typing.Literal['per_image', 'flat_tiles'] = 'flat_tiles',
kwargs = {}
) -> typing.Dict[str, typing.Any]

Process documents into model inputs with tokenized text and pixel values. Args: documents: Either a dict with “images” and “texts” lists, or a list of dicts each with “image” and “text” keys. Images can be PIL Images, file paths, or None/empty string for text-only documents. return_tensors: Output format — “pt” for PyTorch tensors, “np” for numpy arrays. padding: Padding strategy passed to the tokenizer. Defaults to the value set in the processor constructor. truncation: Whether to truncate sequences to p_max_length. pixel_values_layout: How to structure the pixel values output:

  • “flat_tiles”: All image tiles concatenated into a single tensor of shape (total_tiles, C, H, W). Different images may contribute different numbers of tiles. None if no images are present. This is the format expected by the model’s forward() method.
  • “per_image”: A list aligned with the input documents, where each entry is either a tensor of shape (num_tiles, C, H, W) or None. Returns: Dict with “input_ids”, “attention_mask”, and “pixel_values”.
nemo_automodel.components.models.llama_nemotron_vl.processor.LlamaNemotronVLProcessor.process_queries(
queries: typing.List[str],
return_tensors: typing.Literal['pt', 'np'] = 'pt',
padding: bool | str | None = None,
truncation: bool = True,
kwargs = {}
) -> transformers.BatchEncoding

Process queries into model inputs with tokenized text. Args: queries: List of query strings. return_tensors: Output format — “pt” for PyTorch tensors, “np” for numpy arrays. padding: Padding strategy passed to the tokenizer. Defaults to the value set in the processor constructor. truncation: Whether to truncate sequences to q_max_length. Returns: Dict with “input_ids” and “attention_mask”.

nemo_automodel.components.models.llama_nemotron_vl.processor.LlamaNemotronVLProcessor.process_queries_documents_biencoder(
features: typing.Dict,
kwargs = {}
) -> typing.Dict[str, typing.Any]

(Pdb) features [{‘image’: [<PIL.Image.Image image mode=RGB size=1275x1650 at 0x155059A5C3A0>, <PIL.Image.Image image mode=RGB size=1275x1650 at 0x155059A5C580>, <PIL.Image.Image image mode=RGB size=1275x1650 at 0x155059A5C940>], ‘text’: [‘passage: ’, ‘passage: ’, ‘passage: ’], ‘question’: “query: What change did Carl Rey suggest for the Strategic Plan’s website objective deadline?”}, {‘image’: [<PIL.Image.Image image mode=RGB size=1275x1650 at 0x155059A5C0D0>, <PIL.Image.Image image mode=RGB size=1275x1650 at 0x155059A5DC00>, <PIL.Image.Image image mode=RGB size=1275x1650 at 0x155059A5EBF0>], ‘text’: [‘passage: ’, ‘passage: ’, ‘passage: ’], ‘question’: ‘query: What are the name and TIN requirements for individuals with real estate transactions?’}, {‘image’: [<PIL.Image.Image image mode=RGB size=1275x1650 at 0x155059A5D390>, <PIL.Image.Image image mode=RGB size=1275x1650 at 0x155059A5C850>, <PIL.Image.Image image mode=RGB size=1275x1650 at 0x155059A5C070>], ‘text’: [‘passage: ’, ‘passage: ’, ‘passage: ’], ‘question’: ‘query: How does Richard Hooker view human inclinations?’}]

class nemo_automodel.components.models.llama_nemotron_vl.processor.LlamaNemotronVLProcessorConfig()

Bases: PretrainedConfig

Dummy Configuration for LlamaNemotronVLProcessor, just to register the processor with AutoProcessor.

nemo_automodel.components.models.llama_nemotron_vl.processor._register_with_hf_auto_classes()
nemo_automodel.components.models.llama_nemotron_vl.processor.dynamic_preprocess(
image,
min_num = 1,
max_num = 6,
image_size = 448,
use_thumbnail = False
)

Dynamically preprocess an image into a list of image tiles, with a thumbnail if needed.

nemo_automodel.components.models.llama_nemotron_vl.processor.find_closest_aspect_ratio(
aspect_ratio,
target_ratios,
width,
height,
image_size
)

previous version mainly foucs on ratio. We also consider area ratio here.

nemo_automodel.components.models.llama_nemotron_vl.processor.get_conv_template(
name: str
) -> nemo_automodel.components.models.llama_nemotron_vl.processor.Conversation

Initialize a conversation instance with default configuration.

nemo_automodel.components.models.llama_nemotron_vl.processor.load_image(
image
)

Load an image from a file, a URL, a base64 string, or a bytes object.

nemo_automodel.components.models.llama_nemotron_vl.processor.IMAGENET_MEAN = (0.485, 0.456, 0.406)
nemo_automodel.components.models.llama_nemotron_vl.processor.IMAGENET_STD = (0.229, 0.224, 0.225)
nemo_automodel.components.models.llama_nemotron_vl.processor.SIGLIP_MEAN = (0.5, 0.5, 0.5)
nemo_automodel.components.models.llama_nemotron_vl.processor.SIGLIP_STD = (0.5, 0.5, 0.5)