nemo_automodel.components.models.minimax_m3_vl.model
nemo_automodel.components.models.minimax_m3_vl.model
MiniMax M3 (mixed sparse/dense MoE) text backbone.
Stage 1 implements MiniMaxM3TextModel and the standalone
MiniMaxM3SparseForCausalLM so the language path can be parity-tested against
the sglang reference before the vision tower / VLM wrapper (Stage 3) embeds the
text model as language_model.
Module Contents
Classes
Functions
Data
API
Forward output carrying the primary logits and optional per-depth MTP logits.
Bases: HFCheckpointingMixin, Module, MoEFSDPSyncMixin
Standalone M3 text backbone for causal LM (Stage 1 parity target).
Bases: HFCheckpointingMixin, Module, MoEFSDPSyncMixin
MiniMax M3 VL: CLIP-style vision tower + projector/merger + M3 text backbone.
Vision features (vision_tower(pixel_values, grid_thw)) are spliced into
the text embeddings at image_token_index / video_token_index
positions, then run through the (sparse/dense MoE) language model + lm_head.
True when this is a partial pipeline stage (some text modules nulled).
Rewrite auto-generated pipeline FQNs to M3’s real module paths.
M3’s text stack lives directly under self.model and the vision tower
is a top-level sibling (vision_tower). The framework, seeing the
language_model property, derives a nested model.language_model.
prefix for the text modules and a model. prefix for the multimodal
encoders. Map both back to M3’s actual paths so per-stage module nulling
keeps/drops the correct submodules.
Per-stage input/output meta tensors for the PP schedule’s shape inference.
First stage consumes token ids [mb, seq]; later stages consume hidden
states [mb, seq, hidden]. The final stage (owning lm_head) emits
logits [mb, seq, vocab]; earlier stages emit hidden states.
Bases: Module
Embedding + decoder stack + final norm for the M3 text backbone.
Per-depth MTP logits from the final hidden states (shares lm_head).
Build the routed-expert MoEConfig for the M3 backbone.
Shared experts are handled in :class:~...layers.Block (SwiGLU-OAI), so
n_shared_experts is 0 here. Routed experts use the swigluoai
activation gate * sigmoid(alpha * gate) * (up + 1) over the concatenated
grouped gate/up projection produced by MoESplitExpertsStateDictMixin.