nemo_automodel.components.models.minimax_m3_vl.config

View as Markdown

Typed configuration classes for the MiniMax M3 VL family.

The released checkpoint ships configuration_minimax_m3_vl.py which coerces the vision_config/text_config sub-dicts into generic PretrainedConfig instances (the text backbone’s model_type="minimax_m2" is not in HF’s CONFIG_MAPPING). For the native AutoModel implementation we declare typed sub-configs so that fields such as sparse_attention_config, moe_layer_freq and the SwiGLU-OAI parameters are real, defaulted attributes.

Mirrors the canonical sglang reference sglang.srt.configs.minimax_vl and the field set in the checkpoint’s config.json; keep them in sync.

Module Contents

Classes

NameDescription
MiniMaxM3VLConfigTop-level configuration for MiniMax M3 vision-language checkpoints.
MiniMaxM3VLTextConfigConfiguration for the MiniMax M3 (mixed sparse/dense MoE) text backbone.
MiniMaxM3VLVisionConfigConfiguration for the MiniMax M3 VL CLIP-style vision tower (Conv3d + 3D RoPE).

Functions

NameDescription
_json_safe_valueConvert config values that are valid in-memory but not JSON serializable.

API

class nemo_automodel.components.models.minimax_m3_vl.config.MiniMaxM3VLConfig(
vision_config: typing.Optional[typing.Union[dict, nemo_automodel.components.models.minimax_m3_vl.config.MiniMaxM3VLVisionConfig]] = None,
text_config: typing.Optional[typing.Union[dict, nemo_automodel.components.models.minimax_m3_vl.config.MiniMaxM3VLTextConfig]] = None,
image_token_index: int = 200025,
video_token_index: int = 200026,
image_seq_length: int = 576,
process_image_mode: str = 'dynamic_res',
projector_hidden_act: str = 'gelu',
projector_hidden_size: int = 6144,
multimodal_projector_bias: bool = True,
patch_merge_bias: bool = True,
vision_feature_layer: int = -1,
vision_feature_select_strategy: str = 'full',
img_token_compression_config: typing.Optional[dict] = None,
image_grid_pinpoints: typing.Optional[str] = None,
kwargs = {}
)

Bases: PretrainedConfig

Top-level configuration for MiniMax M3 vision-language checkpoints.

hidden_size
= text_config.hidden_size
img_token_compression_config
= img_token_compression_config or {}
max_position_embeddings
= text_config.max_position_embeddings
model_type
= 'minimax_m3_vl'
sub_configs
nemo_automodel.components.models.minimax_m3_vl.config.MiniMaxM3VLConfig.to_dict()
class nemo_automodel.components.models.minimax_m3_vl.config.MiniMaxM3VLTextConfig(
hidden_size: int = 6144,
intermediate_size: int = 3072,
dense_intermediate_size: int = 12288,
shared_intermediate_size: int = 3072,
num_hidden_layers: int = 60,
num_attention_heads: int = 64,
num_key_value_heads: int = 4,
head_dim: int = 128,
vocab_size: int = 200064,
max_position_embeddings: int = 524288,
rms_norm_eps: float = 1e-06,
use_gemma_norm: bool = True,
attention_output_gate: bool = False,
rope_theta: float = 5000000.0,
rotary_dim: int = 64,
partial_rotary_factor: float = 0.5,
hidden_act: str = 'swigluoai',
use_qk_norm: bool = True,
qk_norm_type: str = 'per_head',
tie_word_embeddings: bool = False,
num_local_experts: int = 128,
num_experts_per_tok: int = 4,
n_shared_experts: int = 1,
scoring_func: str = 'sigmoid',
use_routing_bias: bool = True,
routed_scaling_factor: float = 2.0,
moe_layer_freq: typing.Optional[list[int]] = None,
swiglu_alpha: float = 1.702,
swiglu_limit: float = 7.0,
sparse_attention_config: typing.Optional[dict] = None,
num_mtp_modules: int = 1,
pad_token_id: typing.Optional[int] = None,
kwargs = {}
)

Bases: PretrainedConfig

Configuration for the MiniMax M3 (mixed sparse/dense MoE) text backbone.

architectures
= ['MiniMaxM3SparseForCausalLM']
model_type
= 'minimax_m3'
nemo_automodel.components.models.minimax_m3_vl.config.MiniMaxM3VLTextConfig.to_dict()
class nemo_automodel.components.models.minimax_m3_vl.config.MiniMaxM3VLVisionConfig(
hidden_size: int = 1280,
num_attention_heads: int = 16,
num_hidden_layers: int = 32,
intermediate_size: int = 5120,
patch_size: int = 14,
image_size: int = 672,
projection_dim: int = 6144,
num_channels: int = 3,
position_embedding_type: str = 'rope',
rope_mode: str = '3d',
rope_theta: float = 10000.0,
attention_dropout: float = 0.0,
hidden_act: str = 'gelu',
layer_norm_eps: float = 1e-05,
img_token_compression_config: typing.Optional[dict] = None,
vision_segment_max_frames: int = 4,
kwargs = {}
)

Bases: PretrainedConfig

Configuration for the MiniMax M3 VL CLIP-style vision tower (Conv3d + 3D RoPE).

img_token_compression_config
model_type
= 'minimax_m3_vision'
nemo_automodel.components.models.minimax_m3_vl.config.MiniMaxM3VLVisionConfig.to_dict()
nemo_automodel.components.models.minimax_m3_vl.config._json_safe_value(
value: typing.Any
) -> typing.Any

Convert config values that are valid in-memory but not JSON serializable.