> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.models.mistral3_vlm.model

FP8-native Mistral3 VLM (dawn-ridge / Mistral-3.5 128B).

Custom wrapper around HF's `Mistral3ForConditionalGeneration` that:

* Inherits the full VLM architecture (vision\_tower + multi\_modal\_projector
  * Ministral3 language\_model) so image inputs flow through Pixtral.
* Attaches `Mistral3FP8StateDictAdapter.for_vlm_full()` so FP8 dequant
  runs inside the standard DCP load path (avoids HF's FineGrainedFP8
  loader, which materializes the full BF16 model on every rank pre-PP-split
  and OOMs on 80 GB H100).
* Attaches a one-shot forward pre-hook on every rotary submodule to
  recompute `inv_freq` on first call — needed because HF's Ministral3 /
  Pixtral rotaries compute `inv_freq` in `__init__`, so meta-init +
  `to_empty` leaves the buffer at uninitialized memory.

## Module Contents

### Classes

| Name                                                                                                                                    | Description                                                               |
| --------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------- |
| [`Mistral3FP8VLMForConditionalGeneration`](#nemo_automodel-components-models-mistral3_vlm-model-Mistral3FP8VLMForConditionalGeneration) | Full-VLM (vision + text) FP8 loader for Mistral3ForConditionalGeneration. |

### Functions

| Name                                                                                                        | Description                                                          |
| ----------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------- |
| [`_rotary_reinit_self_hook`](#nemo_automodel-components-models-mistral3_vlm-model-_rotary_reinit_self_hook) | One-shot forward pre-hook that recomputes *this rotary module's* own |

### Data

[`logger`](#nemo_automodel-components-models-mistral3_vlm-model-logger)

### API

```python
class nemo_automodel.components.models.mistral3_vlm.model.Mistral3FP8VLMForConditionalGeneration(
    config: transformers.PretrainedConfig
)
```

**Bases:** `_HFMistral3ForConditionalGeneration`

Full-VLM (vision + text) FP8 loader for Mistral3ForConditionalGeneration.

Used when the user instantiates through
`NeMoAutoModelForImageTextToText.from_pretrained` on an FP8-native
Mistral3 VLM checkpoint (e.g. dawn-ridge-128B).

```python
nemo_automodel.components.models.mistral3_vlm.model.Mistral3FP8VLMForConditionalGeneration.forward(
    input_ids: typing.Optional[torch.LongTensor] = None,
    pixel_values: typing.Optional[torch.FloatTensor] = None,
    attention_mask: typing.Optional[torch.Tensor] = None,
    position_ids: typing.Optional[torch.LongTensor] = None,
    past_key_values = None,
    inputs_embeds: typing.Optional[torch.FloatTensor] = None,
    labels: typing.Optional[torch.LongTensor] = None,
    use_cache: typing.Optional[bool] = None,
    logits_to_keep: typing.Union[int, torch.Tensor] = 0,
    image_sizes: typing.Optional[torch.Tensor] = None,
    output_hidden_states: typing.Optional[bool] = None,
    kwargs = {}
) -> transformers.models.mistral3.modeling_mistral3.Mistral3CausalLMOutputWithPast
```

Forward pass with memory-efficient fused cross-entropy (cut-CE) support.

Overrides HF's `Mistral3ForConditionalGeneration.forward` so the
`train_ft` recipe can enable `FusedLinearCrossEntropy`. The recipe
only does so when (a) `forward` exposes a `logits_to_keep` parameter
and (b) calling the model returns an output that carries the FINAL hidden
states (full sequence) while `logits` cover only the kept positions.

HF's stock forward gates `hidden_states` on a per-call
`output_hidden_states` kwarg (which the recipe does not pass) and emits
the full per-layer tuple. Here we instead resolve `output_hidden_states`
from the text sub-config and surface the inner model's `last_hidden_state`
(the single `[B, S, H]` tensor fed to `lm_head`) directly, which is
what `get_final_hidden_states` consumes.

**Parameters:**

Input token IDs `[B, S]`.

Optional image pixel values for the vision tower.

Optional attention mask.

Optional position indices.

Optional cached key/values.

Optional pre-computed embeddings.

Optional labels for loss computation.

Whether to use KV caching.

Number of final logits to compute (0=all, N=last N tokens).

Optional image sizes for the vision tower.

Whether to surface the final hidden states on the
output (defaults to the text sub-config's `output_hidden_states`).

Additional arguments forwarded to the base model.

**Returns:** `Mistral3CausalLMOutputWithPast`

class:`~transformers.models.mistral3.modeling_mistral3.Mistral3CausalLMOutputWithPast`

```python
nemo_automodel.components.models.mistral3_vlm.model.Mistral3FP8VLMForConditionalGeneration.supports_config(
    config: transformers.PretrainedConfig
) -> bool
```

classmethod

Claim FP8-native Mistral3 VLM configs.

Matches `Mistral3Config` (outer VLM) with a ministral3 text backbone
and `quantization_config.quant_method == 'fp8'`.

```python
nemo_automodel.components.models.mistral3_vlm.model._rotary_reinit_self_hook(
    module,
    args,
    kwargs
)
```

One-shot forward pre-hook that recomputes *this rotary module's* own
`inv_freq` on first call.

Attached per-rotary rather than on the outer VLM so it fires correctly
under pipeline parallelism, where the outer model's `forward` is never
called directly — the PP schedule dispatches each stage's sub-modules
individually, and rotary modules run inside every attention layer.

Background: HF's Ministral3 / Pixtral rotary classes initialise
`inv_freq` (and related attributes) in their `__init__`. Under
`accelerate.init_empty_weights` that becomes a meta tensor, and the
subsequent `to_empty(device)` call leaves it uninitialised device
memory. Neither class exposes `rope_init_fn` as an attribute, so the
generic `_reinit_non_persistent_buffers` helper doesn't match. We
recover correctness by re-running the module's own `__init__` on the
target device outside the init\_empty\_weights context — both Ministral3
(YaRN) and Pixtral (2D patch positions) produce the right values this
way, since we defer to the class's authoritative init logic.

```python
nemo_automodel.components.models.mistral3_vlm.model.logger = logging.getLogger(__name__)
```