nemo_automodel.components.models.minimax_m3_vl.vision_encoder
nemo_automodel.components.models.minimax_m3_vl.vision_encoder
MiniMax M3 VL vision tower (CLIP-style, Conv3d patch embed + 3D RoPE).
Mirrors the canonical sglang reference sglang.srt.models.minimax_vl_common:
a Conv3d patch embedding over pre-patchified pixel values, pre_layrnorm, a
stack of bidirectional CLIP encoder layers with axis-split 3D RoPE, then a
2-layer GELU multimodal projector (vision -> text hidden) and a spatial
patch-merger (spatial_merge_size**2 tokens -> 1).
Vision weights are stored unquantized (head_dim is not MXFP8-aligned), and the
checkpoint keeps separate q/k/v/out_proj (no QKV fusion).
Module Contents
Classes
Functions
API
Bases: Module
Bidirectional multi-head attention with separate q/k/v/out projections + 3D RoPE.
Bases: Module
Conv3d patch embedding over pre-patchified pixel values ([N, CTP*P]).
Bases: Module
CLIP-style encoder block: pre-norm attention + pre-norm GELU MLP (fc1/fc2).
Bases: Module
Vision tower: ViT + multimodal projector + patch merger (returns text-dim image tokens).
Bases: Module
Conv3d embeddings + pre_layrnorm + bidirectional CLIP encoder with 3D RoPE.
Bidirectional within each image, no cross-image attention.
Note: this materializes a dense [1, 1, total, total] mask (O(total^2)
memory). It is only built for multi-image batches; single-image inputs use
attn_mask=None. For large multi-image batches a cu_seqlens / varlen
attention path (as in sglang) would avoid the quadratic mask.
Per-token [seq, 3*axis_dim/2] frequencies (t/h/w), spatial-merge-aware.
Bases: Module
2-layer GELU projector: vision_hidden -> projector_hidden -> text_hidden.
Bases: Module
Merge spatial_merge_size**2 projected tokens then GELU-MLP back to text_hidden.
Apply 3D RoPE to the first rope_dim channels of q/k ([S, H, D]).
NEOX-style half rotation: cat([-x2, x1]) (matches the duplicated cos/sin).