> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.models.minimax_m3_vl.config

Typed configuration classes for the MiniMax M3 VL family.

The released checkpoint ships `configuration_minimax_m3_vl.py` which coerces
the `vision_config`/`text_config` sub-dicts into generic `PretrainedConfig`
instances (the text backbone's `model_type="minimax_m2"` is not in HF's
`CONFIG_MAPPING`).  For the native AutoModel implementation we declare typed
sub-configs so that fields such as `sparse_attention_config`, `moe_layer_freq`
and the SwiGLU-OAI parameters are real, defaulted attributes.

Mirrors the canonical sglang reference `sglang.srt.configs.minimax_vl` and the
field set in the checkpoint's `config.json`; keep them in sync.

## Module Contents

### Classes

| Name                                                                                                        | Description                                                                     |
| ----------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------- |
| [`MiniMaxM3VLConfig`](#nemo_automodel-components-models-minimax_m3_vl-config-MiniMaxM3VLConfig)             | Top-level configuration for MiniMax M3 vision-language checkpoints.             |
| [`MiniMaxM3VLTextConfig`](#nemo_automodel-components-models-minimax_m3_vl-config-MiniMaxM3VLTextConfig)     | Configuration for the MiniMax M3 (mixed sparse/dense MoE) text backbone.        |
| [`MiniMaxM3VLVisionConfig`](#nemo_automodel-components-models-minimax_m3_vl-config-MiniMaxM3VLVisionConfig) | Configuration for the MiniMax M3 VL CLIP-style vision tower (Conv3d + 3D RoPE). |

### Functions

| Name                                                                                          | Description                                                               |
| --------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------- |
| [`_json_safe_value`](#nemo_automodel-components-models-minimax_m3_vl-config-_json_safe_value) | Convert config values that are valid in-memory but not JSON serializable. |

### API

```python
class nemo_automodel.components.models.minimax_m3_vl.config.MiniMaxM3VLConfig(
    vision_config: typing.Optional[typing.Union[dict, nemo_automodel.components.models.minimax_m3_vl.config.MiniMaxM3VLVisionConfig]] = None,
    text_config: typing.Optional[typing.Union[dict, nemo_automodel.components.models.minimax_m3_vl.config.MiniMaxM3VLTextConfig]] = None,
    image_token_index: int = 200025,
    video_token_index: int = 200026,
    image_seq_length: int = 576,
    process_image_mode: str = 'dynamic_res',
    projector_hidden_act: str = 'gelu',
    projector_hidden_size: int = 6144,
    multimodal_projector_bias: bool = True,
    patch_merge_bias: bool = True,
    vision_feature_layer: int = -1,
    vision_feature_select_strategy: str = 'full',
    img_token_compression_config: typing.Optional[dict] = None,
    image_grid_pinpoints: typing.Optional[str] = None,
    kwargs = {}
)
```

**Bases:** `PretrainedConfig`

Top-level configuration for MiniMax M3 vision-language checkpoints.

```python
nemo_automodel.components.models.minimax_m3_vl.config.MiniMaxM3VLConfig.to_dict()
```

```python
class nemo_automodel.components.models.minimax_m3_vl.config.MiniMaxM3VLTextConfig(
    hidden_size: int = 6144,
    intermediate_size: int = 3072,
    dense_intermediate_size: int = 12288,
    shared_intermediate_size: int = 3072,
    num_hidden_layers: int = 60,
    num_attention_heads: int = 64,
    num_key_value_heads: int = 4,
    head_dim: int = 128,
    vocab_size: int = 200064,
    max_position_embeddings: int = 524288,
    rms_norm_eps: float = 1e-06,
    use_gemma_norm: bool = True,
    attention_output_gate: bool = False,
    rope_theta: float = 5000000.0,
    rotary_dim: int = 64,
    partial_rotary_factor: float = 0.5,
    hidden_act: str = 'swigluoai',
    use_qk_norm: bool = True,
    qk_norm_type: str = 'per_head',
    tie_word_embeddings: bool = False,
    num_local_experts: int = 128,
    num_experts_per_tok: int = 4,
    n_shared_experts: int = 1,
    scoring_func: str = 'sigmoid',
    use_routing_bias: bool = True,
    routed_scaling_factor: float = 2.0,
    moe_layer_freq: typing.Optional[list[int]] = None,
    swiglu_alpha: float = 1.702,
    swiglu_limit: float = 7.0,
    sparse_attention_config: typing.Optional[dict] = None,
    num_mtp_modules: int = 1,
    pad_token_id: typing.Optional[int] = None,
    kwargs = {}
)
```

**Bases:** `PretrainedConfig`

Configuration for the MiniMax M3 (mixed sparse/dense MoE) text backbone.

```python
nemo_automodel.components.models.minimax_m3_vl.config.MiniMaxM3VLTextConfig.to_dict()
```

```python
class nemo_automodel.components.models.minimax_m3_vl.config.MiniMaxM3VLVisionConfig(
    hidden_size: int = 1280,
    num_attention_heads: int = 16,
    num_hidden_layers: int = 32,
    intermediate_size: int = 5120,
    patch_size: int = 14,
    image_size: int = 672,
    projection_dim: int = 6144,
    num_channels: int = 3,
    position_embedding_type: str = 'rope',
    rope_mode: str = '3d',
    rope_theta: float = 10000.0,
    attention_dropout: float = 0.0,
    hidden_act: str = 'gelu',
    layer_norm_eps: float = 1e-05,
    img_token_compression_config: typing.Optional[dict] = None,
    vision_segment_max_frames: int = 4,
    kwargs = {}
)
```

**Bases:** `PretrainedConfig`

Configuration for the MiniMax M3 VL CLIP-style vision tower (Conv3d + 3D RoPE).

```python
nemo_automodel.components.models.minimax_m3_vl.config.MiniMaxM3VLVisionConfig.to_dict()
```

```python
nemo_automodel.components.models.minimax_m3_vl.config._json_safe_value(
    value: typing.Any
) -> typing.Any
```

Convert config values that are valid in-memory but not JSON serializable.