> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.models.llama_nemotron_vl.processor

## Module Contents

### Classes

| Name                                                                                                                             | Description                                                                |
| -------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------- |
| [`Conversation`](#nemo_automodel-components-models-llama_nemotron_vl-processor-Conversation)                                     | Manages prompt construction with system messages and multi-turn dialogues. |
| [`LlamaNemotronVLImageProcessor`](#nemo_automodel-components-models-llama_nemotron_vl-processor-LlamaNemotronVLImageProcessor)   | Fast batched image processor for Llama Nemotron VL retrieval inputs.       |
| [`LlamaNemotronVLProcessor`](#nemo_automodel-components-models-llama_nemotron_vl-processor-LlamaNemotronVLProcessor)             | Processor for LlamaNemotronVL model.                                       |
| [`LlamaNemotronVLProcessorConfig`](#nemo_automodel-components-models-llama_nemotron_vl-processor-LlamaNemotronVLProcessorConfig) | Dummy Configuration for LlamaNemotronVLProcessor,                          |

### Functions

| Name                                                                                                                             | Description                                                                             |
| -------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------- |
| [`_register_with_hf_auto_classes`](#nemo_automodel-components-models-llama_nemotron_vl-processor-_register_with_hf_auto_classes) | -                                                                                       |
| [`dynamic_preprocess`](#nemo_automodel-components-models-llama_nemotron_vl-processor-dynamic_preprocess)                         | Dynamically preprocess an image into a list of image tiles, with a thumbnail if needed. |
| [`find_closest_aspect_ratio`](#nemo_automodel-components-models-llama_nemotron_vl-processor-find_closest_aspect_ratio)           | previous version mainly foucs on ratio.                                                 |
| [`get_conv_template`](#nemo_automodel-components-models-llama_nemotron_vl-processor-get_conv_template)                           | Initialize a conversation instance with default configuration.                          |
| [`load_image`](#nemo_automodel-components-models-llama_nemotron_vl-processor-load_image)                                         | Load an image from a file, a URL, a base64 string, or a bytes object.                   |

### Data

[`IMAGENET_MEAN`](#nemo_automodel-components-models-llama_nemotron_vl-processor-IMAGENET_MEAN)

[`IMAGENET_STD`](#nemo_automodel-components-models-llama_nemotron_vl-processor-IMAGENET_STD)

[`SIGLIP_MEAN`](#nemo_automodel-components-models-llama_nemotron_vl-processor-SIGLIP_MEAN)

[`SIGLIP_STD`](#nemo_automodel-components-models-llama_nemotron_vl-processor-SIGLIP_STD)

### API

```python
class nemo_automodel.components.models.llama_nemotron_vl.processor.Conversation(
    system_message: str = '',
    roles: typing.Tuple[str, str] = ('', ''),
    messages: typing.List[typing.List[str]] = list(),
    sep: str = '',
    stop_token_ids: typing.List[int] = None
)
```

Dataclass

Manages prompt construction with system messages and multi-turn dialogues.

```python
nemo_automodel.components.models.llama_nemotron_vl.processor.Conversation.append_message(
    role: str,
    message: str
)
```

Add a message turn to the dialogue history.

```python
nemo_automodel.components.models.llama_nemotron_vl.processor.Conversation.get_prompt() -> str
```

Construct the formatted prompt string from system message and dialogue history.

```python
class nemo_automodel.components.models.llama_nemotron_vl.processor.LlamaNemotronVLImageProcessor(
    image_size: int = 512,
    max_num_tiles: int = 6,
    use_thumbnail: bool = True,
    dynamic_image_size: bool = True,
    norm_type: str = 'siglip',
    resample: typing.Optional[typing.Union[transformers.image_utils.PILImageResampling, int]] = None,
    kwargs = {}
)
```

**Bases:** `BaseImageProcessorFast`

Fast batched image processor for Llama Nemotron VL retrieval inputs.

```python
nemo_automodel.components.models.llama_nemotron_vl.processor.LlamaNemotronVLImageProcessor._preprocess(
    images: transformers.image_utils.ImageInput,
    image_size: typing.Optional[int] = None,
    max_num_tiles: typing.Optional[int] = None,
    use_thumbnail: typing.Optional[bool] = None,
    dynamic_image_size: typing.Optional[bool] = None,
    do_rescale: typing.Optional[bool] = None,
    rescale_factor: typing.Optional[float] = None,
    do_normalize: typing.Optional[bool] = None,
    image_mean: typing.Optional[typing.Union[float, typing.List[float]]] = None,
    image_std: typing.Optional[typing.Union[float, typing.List[float]]] = None,
    resample: typing.Optional[typing.Union[transformers.image_utils.PILImageResampling, int]] = None,
    return_tensors: typing.Optional[typing.Union[str, transformers.utils.TensorType]] = None,
    kwargs = {}
) -> transformers.image_processing_base.BatchFeature
```

```python
nemo_automodel.components.models.llama_nemotron_vl.processor.LlamaNemotronVLImageProcessor.dynamic_preprocess(
    image: torch.Tensor,
    image_size: int = 512,
    max_num_tiles: int = 6,
    use_thumbnail: bool = True,
    resample: typing.Optional[typing.Union[transformers.image_utils.PILImageResampling, int]] = None
) -> typing.List[torch.Tensor]
```

Split one channel-first image tensor into dynamically sized square tiles.

```python
class nemo_automodel.components.models.llama_nemotron_vl.processor.LlamaNemotronVLProcessor(
    tokenizer: typing.Any,
    config: typing.Optional[nemo_automodel.components.models.llama_nemotron_vl.processor.LlamaNemotronVLProcessorConfig] = None,
    q_max_length: typing.Optional[int] = None,
    p_max_length: typing.Optional[int] = None,
    pad_to_multiple_of: typing.Optional[int] = None,
    query_prefix: str = 'query:',
    passage_prefix: str = 'passage:',
    max_input_tiles: int = 6,
    num_image_token: int = 256,
    dynamic_image_size: bool = True,
    image_size: int = 512,
    use_thumbnail: bool = True,
    template: str = 'bidirectional-llama-retrie...,
    num_channels: int = 3,
    norm_type: str = 'siglip',
    system_message: str = '',
    padding: typing.Union[bool, str] = True,
    kwargs = {}
)
```

**Bases:** `ProcessorMixin`

Processor for LlamaNemotronVL model.

```python
nemo_automodel.components.models.llama_nemotron_vl.processor.LlamaNemotronVLProcessor.__call__(
    text: typing.Optional[typing.List[str]] = None,
    images: typing.Optional[typing.List[typing.Any]] = None,
    text_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None,
    images_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None,
    common_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None,
    kwargs = {}
) -> typing.Dict[str, typing.Any]
```

Process text and/or image inputs into model-ready features.
This method provides compatibility with the standard HuggingFace processor interface
used by Sentence Transformers. For image inputs, it delegates to process\_documents.
For text-only inputs, it tokenizes directly (assuming any task prefix has already been
applied by the caller).
Args:
text: List of text strings. For text-only inputs, these should already include
any task prefix (e.g. "query: " or "passage: ").
images: List of PIL Images for document encoding.
text\_kwargs: Keyword arguments for text processing (e.g. padding, truncation).
images\_kwargs: Keyword arguments for image processing (unused, for API compat).
common\_kwargs: Common keyword arguments (e.g. return\_tensors).
\*\*kwargs: Additional keyword arguments (ignored).
Returns:
Dict with "input\_ids", "attention\_mask", and optionally "pixel\_values".

```python
nemo_automodel.components.models.llama_nemotron_vl.processor.LlamaNemotronVLProcessor.add_dummy_labels(
    questions,
    merged_batch_dict
)
```

```python
nemo_automodel.components.models.llama_nemotron_vl.processor.LlamaNemotronVLProcessor.merge_batch_dict(
    query_batch_dict,
    doc_batch_dict
)
```

```python
nemo_automodel.components.models.llama_nemotron_vl.processor.LlamaNemotronVLProcessor.process_documents(
    documents: typing.Union[typing.Dict, typing.List[typing.Dict]],
    return_tensors: typing.Literal['pt', 'np'] = 'pt',
    padding: bool | str | None = None,
    truncation: bool = True,
    pixel_values_layout: typing.Literal['per_image', 'flat_tiles'] = 'flat_tiles',
    kwargs = {}
) -> typing.Dict[str, typing.Any]
```

Process documents into model inputs with tokenized text and pixel values.
Args:
documents: Either a dict with "images" and "texts" lists, or a list of
dicts each with "image" and "text" keys. Images can be PIL Images,
file paths, or None/empty string for text-only documents.
return\_tensors: Output format — "pt" for PyTorch tensors, "np" for numpy arrays.
padding: Padding strategy passed to the tokenizer. Defaults to the value
set in the processor constructor.
truncation: Whether to truncate sequences to p\_max\_length.
pixel\_values\_layout: How to structure the pixel values output:

* "flat\_tiles": All image tiles concatenated into a single tensor of shape
  (total\_tiles, C, H, W). Different images may contribute different numbers
  of tiles. None if no images are present. This is the format expected by
  the model's forward() method.
* "per\_image": A list aligned with the input documents, where each entry
  is either a tensor of shape (num\_tiles, C, H, W) or None.
  Returns:
  Dict with "input\_ids", "attention\_mask", and "pixel\_values".

```python
nemo_automodel.components.models.llama_nemotron_vl.processor.LlamaNemotronVLProcessor.process_queries(
    queries: typing.List[str],
    return_tensors: typing.Literal['pt', 'np'] = 'pt',
    padding: bool | str | None = None,
    truncation: bool = True,
    kwargs = {}
) -> transformers.BatchEncoding
```

Process queries into model inputs with tokenized text.
Args:
queries: List of query strings.
return\_tensors: Output format — "pt" for PyTorch tensors, "np" for numpy arrays.
padding: Padding strategy passed to the tokenizer. Defaults to the value
set in the processor constructor.
truncation: Whether to truncate sequences to q\_max\_length.
Returns:
Dict with "input\_ids" and "attention\_mask".

```python
nemo_automodel.components.models.llama_nemotron_vl.processor.LlamaNemotronVLProcessor.process_queries_documents_biencoder(
    features: typing.Dict,
    kwargs = {}
) -> typing.Dict[str, typing.Any]
```

(Pdb) features
\[\{'image': \[\<PIL.Image.Image image mode=RGB size=1275x1650 at 0x155059A5C3A0>, \<PIL.Image.Image image mode=RGB size=1275x1650 at 0x155059A5C580>, \<PIL.Image.Image image mode=RGB size=1275x1650 at 0x155059A5C940>], 'text': \['passage: ', 'passage: ', 'passage: '], 'question': "query: What change did Carl Rey suggest for the Strategic Plan's website objective deadline?"}, \{'image': \[\<PIL.Image.Image image mode=RGB size=1275x1650 at 0x155059A5C0D0>, \<PIL.Image.Image image mode=RGB size=1275x1650 at 0x155059A5DC00>, \<PIL.Image.Image image mode=RGB size=1275x1650 at 0x155059A5EBF0>], 'text': \['passage: ', 'passage: ', 'passage: '], 'question': 'query: What are the name and TIN requirements for individuals with real estate transactions?'}, \{'image': \[\<PIL.Image.Image image mode=RGB size=1275x1650 at 0x155059A5D390>, \<PIL.Image.Image image mode=RGB size=1275x1650 at 0x155059A5C850>, \<PIL.Image.Image image mode=RGB size=1275x1650 at 0x155059A5C070>], 'text': \['passage: ', 'passage: ', 'passage: '], 'question': 'query: How does Richard Hooker view human inclinations?'}]

```python
class nemo_automodel.components.models.llama_nemotron_vl.processor.LlamaNemotronVLProcessorConfig()
```

**Bases:** `PretrainedConfig`

Dummy Configuration for LlamaNemotronVLProcessor,
just to register the processor with AutoProcessor.

```python
nemo_automodel.components.models.llama_nemotron_vl.processor._register_with_hf_auto_classes()
```

```python
nemo_automodel.components.models.llama_nemotron_vl.processor.dynamic_preprocess(
    image,
    min_num = 1,
    max_num = 6,
    image_size = 448,
    use_thumbnail = False
)
```

Dynamically preprocess an image into a list of image tiles, with a thumbnail if needed.

```python
nemo_automodel.components.models.llama_nemotron_vl.processor.find_closest_aspect_ratio(
    aspect_ratio,
    target_ratios,
    width,
    height,
    image_size
)
```

previous version mainly foucs on ratio.
We also consider area ratio here.

```python
nemo_automodel.components.models.llama_nemotron_vl.processor.get_conv_template(
    name: str
) -> nemo_automodel.components.models.llama_nemotron_vl.processor.Conversation
```

Initialize a conversation instance with default configuration.

```python
nemo_automodel.components.models.llama_nemotron_vl.processor.load_image(
    image
)
```

Load an image from a file, a URL, a base64 string, or a bytes object.

```python
nemo_automodel.components.models.llama_nemotron_vl.processor.IMAGENET_MEAN = (0.485, 0.456, 0.406)
```

```python
nemo_automodel.components.models.llama_nemotron_vl.processor.IMAGENET_STD = (0.229, 0.224, 0.225)
```

```python
nemo_automodel.components.models.llama_nemotron_vl.processor.SIGLIP_MEAN = (0.5, 0.5, 0.5)
```

```python
nemo_automodel.components.models.llama_nemotron_vl.processor.SIGLIP_STD = (0.5, 0.5, 0.5)
```