nemo_automodel.components.models.llama_nemotron_vl.processor
nemo_automodel.components.models.llama_nemotron_vl.processor
Module Contents
Classes
Functions
Data
API
Manages prompt construction with system messages and multi-turn dialogues.
Add a message turn to the dialogue history.
Construct the formatted prompt string from system message and dialogue history.
Bases: BaseImageProcessorFast
Fast batched image processor for Llama Nemotron VL retrieval inputs.
Split one channel-first image tensor into dynamically sized square tiles.
Bases: ProcessorMixin
Processor for LlamaNemotronVL model.
Process text and/or image inputs into model-ready features. This method provides compatibility with the standard HuggingFace processor interface used by Sentence Transformers. For image inputs, it delegates to process_documents. For text-only inputs, it tokenizes directly (assuming any task prefix has already been applied by the caller). Args: text: List of text strings. For text-only inputs, these should already include any task prefix (e.g. “query: ” or “passage: ”). images: List of PIL Images for document encoding. text_kwargs: Keyword arguments for text processing (e.g. padding, truncation). images_kwargs: Keyword arguments for image processing (unused, for API compat). common_kwargs: Common keyword arguments (e.g. return_tensors). **kwargs: Additional keyword arguments (ignored). Returns: Dict with “input_ids”, “attention_mask”, and optionally “pixel_values”.
Process documents into model inputs with tokenized text and pixel values. Args: documents: Either a dict with “images” and “texts” lists, or a list of dicts each with “image” and “text” keys. Images can be PIL Images, file paths, or None/empty string for text-only documents. return_tensors: Output format — “pt” for PyTorch tensors, “np” for numpy arrays. padding: Padding strategy passed to the tokenizer. Defaults to the value set in the processor constructor. truncation: Whether to truncate sequences to p_max_length. pixel_values_layout: How to structure the pixel values output:
- “flat_tiles”: All image tiles concatenated into a single tensor of shape (total_tiles, C, H, W). Different images may contribute different numbers of tiles. None if no images are present. This is the format expected by the model’s forward() method.
- “per_image”: A list aligned with the input documents, where each entry is either a tensor of shape (num_tiles, C, H, W) or None. Returns: Dict with “input_ids”, “attention_mask”, and “pixel_values”.
Process queries into model inputs with tokenized text. Args: queries: List of query strings. return_tensors: Output format — “pt” for PyTorch tensors, “np” for numpy arrays. padding: Padding strategy passed to the tokenizer. Defaults to the value set in the processor constructor. truncation: Whether to truncate sequences to q_max_length. Returns: Dict with “input_ids” and “attention_mask”.
(Pdb) features [{‘image’: [<PIL.Image.Image image mode=RGB size=1275x1650 at 0x155059A5C3A0>, <PIL.Image.Image image mode=RGB size=1275x1650 at 0x155059A5C580>, <PIL.Image.Image image mode=RGB size=1275x1650 at 0x155059A5C940>], ‘text’: [‘passage: ’, ‘passage: ’, ‘passage: ’], ‘question’: “query: What change did Carl Rey suggest for the Strategic Plan’s website objective deadline?”}, {‘image’: [<PIL.Image.Image image mode=RGB size=1275x1650 at 0x155059A5C0D0>, <PIL.Image.Image image mode=RGB size=1275x1650 at 0x155059A5DC00>, <PIL.Image.Image image mode=RGB size=1275x1650 at 0x155059A5EBF0>], ‘text’: [‘passage: ’, ‘passage: ’, ‘passage: ’], ‘question’: ‘query: What are the name and TIN requirements for individuals with real estate transactions?’}, {‘image’: [<PIL.Image.Image image mode=RGB size=1275x1650 at 0x155059A5D390>, <PIL.Image.Image image mode=RGB size=1275x1650 at 0x155059A5C850>, <PIL.Image.Image image mode=RGB size=1275x1650 at 0x155059A5C070>], ‘text’: [‘passage: ’, ‘passage: ’, ‘passage: ’], ‘question’: ‘query: How does Richard Hooker view human inclinations?’}]
Bases: PretrainedConfig
Dummy Configuration for LlamaNemotronVLProcessor, just to register the processor with AutoProcessor.
Dynamically preprocess an image into a list of image tiles, with a thumbnail if needed.
previous version mainly foucs on ratio. We also consider area ratio here.
Initialize a conversation instance with default configuration.
Load an image from a file, a URL, a base64 string, or a bytes object.