nemo_automodel.components.models.mistral3_vlm.model#
FP8-native Mistral3 VLM (dawn-ridge / Mistral-3.5 128B).
Custom wrapper around HF’s Mistral3ForConditionalGeneration that:
Inherits the full VLM architecture (vision_tower + multi_modal_projector
Ministral3 language_model) so image inputs flow through Pixtral.
Attaches
Mistral3FP8StateDictAdapter.for_vlm_full()so FP8 dequant runs inside the standard DCP load path (avoids HF’s FineGrainedFP8 loader, which materializes the full BF16 model on every rank pre-PP-split and OOMs on 80 GB H100).Attaches a one-shot forward pre-hook on every rotary submodule to recompute
inv_freqon first call — needed because HF’s Ministral3 / Pixtral rotaries computeinv_freqin__init__, so meta-init +to_emptyleaves the buffer at uninitialized memory.
Module Contents#
Classes#
Full-VLM (vision + text) FP8 loader for Mistral3ForConditionalGeneration. |
Functions#
One-shot forward pre-hook that recomputes this rotary module’s own
|
Data#
API#
- nemo_automodel.components.models.mistral3_vlm.model.logger#
‘getLogger(…)’
- nemo_automodel.components.models.mistral3_vlm.model._rotary_reinit_self_hook(module, args, kwargs)#
One-shot forward pre-hook that recomputes this rotary module’s own
inv_freqon first call.Attached per-rotary rather than on the outer VLM so it fires correctly under pipeline parallelism, where the outer model’s
forwardis never called directly — the PP schedule dispatches each stage’s sub-modules individually, and rotary modules run inside every attention layer.Background: HF’s Ministral3 / Pixtral rotary classes initialise
inv_freq(and related attributes) in their__init__. Underaccelerate.init_empty_weightsthat becomes a meta tensor, and the subsequentto_empty(device)call leaves it uninitialised device memory. Neither class exposesrope_init_fnas an attribute, so the generic_reinit_non_persistent_buffershelper doesn’t match. We recover correctness by re-running the module’s own__init__on the target device outside the init_empty_weights context — both Ministral3 (YaRN) and Pixtral (2D patch positions) produce the right values this way, since we defer to the class’s authoritative init logic.
- class nemo_automodel.components.models.mistral3_vlm.model.Mistral3FP8VLMForConditionalGeneration(
- config: transformers.PretrainedConfig,
Bases:
transformers.models.mistral3.modeling_mistral3.Mistral3ForConditionalGenerationFull-VLM (vision + text) FP8 loader for Mistral3ForConditionalGeneration.
Used when the user instantiates through
NeMoAutoModelForImageTextToText.from_pretrainedon an FP8-native Mistral3 VLM checkpoint (e.g. dawn-ridge-128B).Initialization
- _skip_init_weights_on_load#
True
- classmethod supports_config(config: transformers.PretrainedConfig) bool#
Claim FP8-native Mistral3 VLM configs.
Matches
Mistral3Config(outer VLM) with a ministral3 text backbone andquantization_config.quant_method == 'fp8'.