nemo_automodel.components.models.minimax_m3_vl.vision_encoder

MiniMax M3 VL vision tower (CLIP-style, Conv3d patch embed + 3D RoPE).

Mirrors the canonical sglang reference sglang.srt.models.minimax_vl_common: a Conv3d patch embedding over pre-patchified pixel values, pre_layrnorm, a stack of bidirectional CLIP encoder layers with axis-split 3D RoPE, then a 2-layer GELU multimodal projector (vision -> text hidden) and a spatial patch-merger (spatial_merge_size**2 tokens -> 1).

Vision weights are stored unquantized (head_dim is not MXFP8-aligned), and the checkpoint keeps separate q/k/v/out_proj (no QKV fusion).

Module Contents

Classes

Name	Description
`MiniMaxM3VisionAttention`	Bidirectional multi-head attention with separate q/k/v/out projections + 3D RoPE.
`MiniMaxM3VisionEmbeddings`	Conv3d patch embedding over pre-patchified pixel values ([N, CTP*P]).
`MiniMaxM3VisionEncoderLayer`	CLIP-style encoder block: pre-norm attention + pre-norm GELU MLP (fc1/fc2).
`MiniMaxM3VisionModel`	Vision tower: ViT + multimodal projector + patch merger (returns text-dim image tokens).
`MiniMaxM3VisionTransformer`	Conv3d embeddings + pre_layrnorm + bidirectional CLIP encoder with 3D RoPE.
`MiniMaxVLMultiModalProjector`	2-layer GELU projector: vision_hidden -> projector_hidden -> text_hidden.
`MiniMaxVLPatchMerger`	Merge `spatial_merge_size**2` projected tokens then GELU-MLP back to text_hidden.

Functions

Name	Description
`_apply_vision_rope`	Apply 3D RoPE to the first `rope_dim` channels of q/k ([S, H, D]).
`_rotate_half`	NEOX-style half rotation: `cat([-x2, x1])` (matches the duplicated cos/sin).

API

class nemo_automodel.components.models.minimax_m3_vl.vision_encoder.MiniMaxM3VisionAttention(
    config: typing.Any
)

Bases: Module

Bidirectional multi-head attention with separate q/k/v/out projections + 3D RoPE.

head_dim

= config.hidden_size // self.num_heads

k_proj

num_heads

= config.num_attention_heads

out_proj

q_proj

scale

= self.head_dim ** -0.5

v_proj

nemo_automodel.components.models.minimax_m3_vl.vision_encoder.MiniMaxM3VisionAttention.forward(
    x: torch.Tensor,
    cos: torch.Tensor,
    sin: torch.Tensor,
    attn_mask: torch.Tensor | None
)

class nemo_automodel.components.models.minimax_m3_vl.vision_encoder.MiniMaxM3VisionEmbeddings(
    config: typing.Any
)

Bases: Module

Conv3d patch embedding over pre-patchified pixel values ([N, CTP*P]).

num_channels

= config.num_channels

patch_embedding

patch_size

= config.patch_size

temporal_patch_size

nemo_automodel.components.models.minimax_m3_vl.vision_encoder.MiniMaxM3VisionEmbeddings.forward(
    pixel_values: torch.Tensor
) -> torch.Tensor

class nemo_automodel.components.models.minimax_m3_vl.vision_encoder.MiniMaxM3VisionEncoderLayer(
    config: typing.Any
)

Bases: Module

CLIP-style encoder block: pre-norm attention + pre-norm GELU MLP (fc1/fc2).

act

= ACT2FN[config.hidden_act]

layer_norm1

layer_norm2

mlp

= nn.Module()

self_attn

= MiniMaxM3VisionAttention(config)

nemo_automodel.components.models.minimax_m3_vl.vision_encoder.MiniMaxM3VisionEncoderLayer.forward(
    x,
    cos,
    sin,
    attn_mask
)

class nemo_automodel.components.models.minimax_m3_vl.vision_encoder.MiniMaxM3VisionModel(
    config: typing.Any,
    text_hidden_size: int,
    projector_hidden_size: int,
    projector_hidden_act: str = 'gelu',
    multimodal_projector_bias: bool = True,
    patch_merge_bias: bool = True
)

Bases: Module

Vision tower: ViT + multimodal projector + patch merger (returns text-dim image tokens).

multi_modal_projector

patch_merge_mlp

vision_model

= MiniMaxM3VisionTransformer(config)

nemo_automodel.components.models.minimax_m3_vl.vision_encoder.MiniMaxM3VisionModel.forward(
    pixel_values: torch.Tensor,
    grid_thw: list[list[int]]
) -> torch.Tensor

class nemo_automodel.components.models.minimax_m3_vl.vision_encoder.MiniMaxM3VisionTransformer(
    config: typing.Any
)

Bases: Module

Conv3d embeddings + pre_layrnorm + bidirectional CLIP encoder with 3D RoPE.

embeddings

= MiniMaxM3VisionEmbeddings(config)

encoder

= nn.Module()

pre_layrnorm

spatial_merge_size

nemo_automodel.components.models.minimax_m3_vl.vision_encoder.MiniMaxM3VisionTransformer._block_diag_mask(
    grid_thw: list[list[int]],
    device
) -> torch.Tensor

staticmethod

Bidirectional within each image, no cross-image attention.

Note: this materializes a dense [1, 1, total, total] mask (O(total^2) memory). It is only built for multi-image batches; single-image inputs use attn_mask=None. For large multi-image batches a cu_seqlens / varlen attention path (as in sglang) would avoid the quadratic mask.

nemo_automodel.components.models.minimax_m3_vl.vision_encoder.MiniMaxM3VisionTransformer._rope_position_freqs(
    grid_thw: list[list[int]],
    device
) -> torch.Tensor

Per-token [seq, 3*axis_dim/2] frequencies (t/h/w), spatial-merge-aware.

nemo_automodel.components.models.minimax_m3_vl.vision_encoder.MiniMaxM3VisionTransformer.forward(
    pixel_values: torch.Tensor,
    grid_thw: list[list[int]]
) -> torch.Tensor

class nemo_automodel.components.models.minimax_m3_vl.vision_encoder.MiniMaxVLMultiModalProjector(
    vision_hidden: int,
    text_hidden: int,
    projector_hidden: int,
    act: str,
    bias: bool
)

Bases: Module

2-layer GELU projector: vision_hidden -> projector_hidden -> text_hidden.

act

= ACT2FN[act]

linear_1

linear_2

nemo_automodel.components.models.minimax_m3_vl.vision_encoder.MiniMaxVLMultiModalProjector.forward(
    x: torch.Tensor
) -> torch.Tensor

class nemo_automodel.components.models.minimax_m3_vl.vision_encoder.MiniMaxVLPatchMerger(
    spatial_merge_size: int,
    text_hidden: int,
    projector_hidden: int,
    act: str,
    bias: bool
)

Bases: Module

Merge spatial_merge_size**2 projected tokens then GELU-MLP back to text_hidden.

act

= ACT2FN[act]

linear_1

linear_2

nemo_automodel.components.models.minimax_m3_vl.vision_encoder.MiniMaxVLPatchMerger.forward(
    x: torch.Tensor
) -> torch.Tensor

nemo_automodel.components.models.minimax_m3_vl.vision_encoder._apply_vision_rope(
    q: torch.Tensor,
    k: torch.Tensor,
    cos: torch.Tensor,
    sin: torch.Tensor
)

Apply 3D RoPE to the first rope_dim channels of q/k ([S, H, D]).

nemo_automodel.components.models.minimax_m3_vl.vision_encoder._rotate_half(
    x: torch.Tensor
) -> torch.Tensor

NEOX-style half rotation: cat([-x2, x1]) (matches the duplicated cos/sin).