nemo_automodel.components.models.mistral3_vlm.model#

FP8-native Mistral3 VLM (dawn-ridge / Mistral-3.5 128B).

Custom wrapper around HF’s Mistral3ForConditionalGeneration that:

  • Inherits the full VLM architecture (vision_tower + multi_modal_projector

    • Ministral3 language_model) so image inputs flow through Pixtral.

  • Attaches Mistral3FP8StateDictAdapter.for_vlm_full() so FP8 dequant runs inside the standard DCP load path (avoids HF’s FineGrainedFP8 loader, which materializes the full BF16 model on every rank pre-PP-split and OOMs on 80 GB H100).

  • Attaches a one-shot forward pre-hook on every rotary submodule to recompute inv_freq on first call — needed because HF’s Ministral3 / Pixtral rotaries compute inv_freq in __init__, so meta-init + to_empty leaves the buffer at uninitialized memory.

Module Contents#

Classes#

Mistral3FP8VLMForConditionalGeneration

Full-VLM (vision + text) FP8 loader for Mistral3ForConditionalGeneration.

Functions#

_rotary_reinit_self_hook

One-shot forward pre-hook that recomputes this rotary module’s own inv_freq on first call.

Data#

API#

nemo_automodel.components.models.mistral3_vlm.model.logger#

‘getLogger(…)’

nemo_automodel.components.models.mistral3_vlm.model._rotary_reinit_self_hook(module, args, kwargs)#

One-shot forward pre-hook that recomputes this rotary module’s own inv_freq on first call.

Attached per-rotary rather than on the outer VLM so it fires correctly under pipeline parallelism, where the outer model’s forward is never called directly — the PP schedule dispatches each stage’s sub-modules individually, and rotary modules run inside every attention layer.

Background: HF’s Ministral3 / Pixtral rotary classes initialise inv_freq (and related attributes) in their __init__. Under accelerate.init_empty_weights that becomes a meta tensor, and the subsequent to_empty(device) call leaves it uninitialised device memory. Neither class exposes rope_init_fn as an attribute, so the generic _reinit_non_persistent_buffers helper doesn’t match. We recover correctness by re-running the module’s own __init__ on the target device outside the init_empty_weights context — both Ministral3 (YaRN) and Pixtral (2D patch positions) produce the right values this way, since we defer to the class’s authoritative init logic.

class nemo_automodel.components.models.mistral3_vlm.model.Mistral3FP8VLMForConditionalGeneration(
config: transformers.PretrainedConfig,
)#

Bases: transformers.models.mistral3.modeling_mistral3.Mistral3ForConditionalGeneration

Full-VLM (vision + text) FP8 loader for Mistral3ForConditionalGeneration.

Used when the user instantiates through NeMoAutoModelForImageTextToText.from_pretrained on an FP8-native Mistral3 VLM checkpoint (e.g. dawn-ridge-128B).

Initialization

_skip_init_weights_on_load#

True

classmethod supports_config(config: transformers.PretrainedConfig) bool#

Claim FP8-native Mistral3 VLM configs.

Matches Mistral3Config (outer VLM) with a ministral3 text backbone and quantization_config.quant_method == 'fp8'.