nemo_automodel.components.models.minimax_m3_vl.vision_encoder

View as Markdown

MiniMax M3 VL vision tower (CLIP-style, Conv3d patch embed + 3D RoPE).

Mirrors the canonical sglang reference sglang.srt.models.minimax_vl_common: a Conv3d patch embedding over pre-patchified pixel values, pre_layrnorm, a stack of bidirectional CLIP encoder layers with axis-split 3D RoPE, then a 2-layer GELU multimodal projector (vision -> text hidden) and a spatial patch-merger (spatial_merge_size**2 tokens -> 1).

Vision weights are stored unquantized (head_dim is not MXFP8-aligned), and the checkpoint keeps separate q/k/v/out_proj (no QKV fusion).

Module Contents

Classes

NameDescription
MiniMaxM3VisionAttentionBidirectional multi-head attention with separate q/k/v/out projections + 3D RoPE.
MiniMaxM3VisionEmbeddingsConv3d patch embedding over pre-patchified pixel values ([N, CTP*P]).
MiniMaxM3VisionEncoderLayerCLIP-style encoder block: pre-norm attention + pre-norm GELU MLP (fc1/fc2).
MiniMaxM3VisionModelVision tower: ViT + multimodal projector + patch merger (returns text-dim image tokens).
MiniMaxM3VisionTransformerConv3d embeddings + pre_layrnorm + bidirectional CLIP encoder with 3D RoPE.
MiniMaxVLMultiModalProjector2-layer GELU projector: vision_hidden -> projector_hidden -> text_hidden.
MiniMaxVLPatchMergerMerge spatial_merge_size**2 projected tokens then GELU-MLP back to text_hidden.

Functions

NameDescription
_apply_vision_ropeApply 3D RoPE to the first rope_dim channels of q/k ([S, H, D]).
_rotate_halfNEOX-style half rotation: cat([-x2, x1]) (matches the duplicated cos/sin).

API

class nemo_automodel.components.models.minimax_m3_vl.vision_encoder.MiniMaxM3VisionAttention(
config: typing.Any
)

Bases: Module

Bidirectional multi-head attention with separate q/k/v/out projections + 3D RoPE.

head_dim
= config.hidden_size // self.num_heads
k_proj
num_heads
= config.num_attention_heads
out_proj
q_proj
scale
= self.head_dim ** -0.5
v_proj
nemo_automodel.components.models.minimax_m3_vl.vision_encoder.MiniMaxM3VisionAttention.forward(
x: torch.Tensor,
cos: torch.Tensor,
sin: torch.Tensor,
attn_mask: torch.Tensor | None
)
class nemo_automodel.components.models.minimax_m3_vl.vision_encoder.MiniMaxM3VisionEmbeddings(
config: typing.Any
)

Bases: Module

Conv3d patch embedding over pre-patchified pixel values ([N, CTP*P]).

num_channels
= config.num_channels
patch_embedding
patch_size
= config.patch_size
temporal_patch_size
nemo_automodel.components.models.minimax_m3_vl.vision_encoder.MiniMaxM3VisionEmbeddings.forward(
pixel_values: torch.Tensor
) -> torch.Tensor
class nemo_automodel.components.models.minimax_m3_vl.vision_encoder.MiniMaxM3VisionEncoderLayer(
config: typing.Any
)

Bases: Module

CLIP-style encoder block: pre-norm attention + pre-norm GELU MLP (fc1/fc2).

act
= ACT2FN[config.hidden_act]
layer_norm1
layer_norm2
mlp
= nn.Module()
self_attn
= MiniMaxM3VisionAttention(config)
nemo_automodel.components.models.minimax_m3_vl.vision_encoder.MiniMaxM3VisionEncoderLayer.forward(
x,
cos,
sin,
attn_mask
)
class nemo_automodel.components.models.minimax_m3_vl.vision_encoder.MiniMaxM3VisionModel(
config: typing.Any,
text_hidden_size: int,
projector_hidden_size: int,
projector_hidden_act: str = 'gelu',
multimodal_projector_bias: bool = True,
patch_merge_bias: bool = True
)

Bases: Module

Vision tower: ViT + multimodal projector + patch merger (returns text-dim image tokens).

multi_modal_projector
patch_merge_mlp
vision_model
= MiniMaxM3VisionTransformer(config)
nemo_automodel.components.models.minimax_m3_vl.vision_encoder.MiniMaxM3VisionModel.forward(
pixel_values: torch.Tensor,
grid_thw: list[list[int]]
) -> torch.Tensor
class nemo_automodel.components.models.minimax_m3_vl.vision_encoder.MiniMaxM3VisionTransformer(
config: typing.Any
)

Bases: Module

Conv3d embeddings + pre_layrnorm + bidirectional CLIP encoder with 3D RoPE.

embeddings
= MiniMaxM3VisionEmbeddings(config)
encoder
= nn.Module()
pre_layrnorm
spatial_merge_size
nemo_automodel.components.models.minimax_m3_vl.vision_encoder.MiniMaxM3VisionTransformer._block_diag_mask(
grid_thw: list[list[int]],
device
) -> torch.Tensor
staticmethod

Bidirectional within each image, no cross-image attention.

Note: this materializes a dense [1, 1, total, total] mask (O(total^2) memory). It is only built for multi-image batches; single-image inputs use attn_mask=None. For large multi-image batches a cu_seqlens / varlen attention path (as in sglang) would avoid the quadratic mask.

nemo_automodel.components.models.minimax_m3_vl.vision_encoder.MiniMaxM3VisionTransformer._rope_position_freqs(
grid_thw: list[list[int]],
device
) -> torch.Tensor

Per-token [seq, 3*axis_dim/2] frequencies (t/h/w), spatial-merge-aware.

nemo_automodel.components.models.minimax_m3_vl.vision_encoder.MiniMaxM3VisionTransformer.forward(
pixel_values: torch.Tensor,
grid_thw: list[list[int]]
) -> torch.Tensor
class nemo_automodel.components.models.minimax_m3_vl.vision_encoder.MiniMaxVLMultiModalProjector(
vision_hidden: int,
text_hidden: int,
projector_hidden: int,
act: str,
bias: bool
)

Bases: Module

2-layer GELU projector: vision_hidden -> projector_hidden -> text_hidden.

act
= ACT2FN[act]
linear_1
linear_2
nemo_automodel.components.models.minimax_m3_vl.vision_encoder.MiniMaxVLMultiModalProjector.forward(
x: torch.Tensor
) -> torch.Tensor
class nemo_automodel.components.models.minimax_m3_vl.vision_encoder.MiniMaxVLPatchMerger(
spatial_merge_size: int,
text_hidden: int,
projector_hidden: int,
act: str,
bias: bool
)

Bases: Module

Merge spatial_merge_size**2 projected tokens then GELU-MLP back to text_hidden.

act
= ACT2FN[act]
linear_1
linear_2
nemo_automodel.components.models.minimax_m3_vl.vision_encoder.MiniMaxVLPatchMerger.forward(
x: torch.Tensor
) -> torch.Tensor
nemo_automodel.components.models.minimax_m3_vl.vision_encoder._apply_vision_rope(
q: torch.Tensor,
k: torch.Tensor,
cos: torch.Tensor,
sin: torch.Tensor
)

Apply 3D RoPE to the first rope_dim channels of q/k ([S, H, D]).

nemo_automodel.components.models.minimax_m3_vl.vision_encoder._rotate_half(
x: torch.Tensor
) -> torch.Tensor

NEOX-style half rotation: cat([-x2, x1]) (matches the duplicated cos/sin).