> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.models.mistral4.model

## Module Contents

### Classes

| Name                                                                                                                    | Description                                                                             |
| ----------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------- |
| [`Mistral3ForConditionalGeneration`](#nemo_automodel-components-models-mistral4-model-Mistral3ForConditionalGeneration) | Full multimodal Mistral 4: Pixtral vision + projector + Mistral4 MLA/MoE text backbone. |
| [`Mistral3Model`](#nemo_automodel-components-models-mistral4-model-Mistral3Model)                                       | VLM wrapper composing vision tower + projector + Mistral4 text backend.                 |
| [`Mistral4Block`](#nemo_automodel-components-models-mistral4-model-Mistral4Block)                                       | Block using Mistral4MLA instead of MLA.                                                 |
| [`Mistral4ForCausalLM`](#nemo_automodel-components-models-mistral4-model-Mistral4ForCausalLM)                           | -                                                                                       |
| [`Mistral4MLA`](#nemo_automodel-components-models-mistral4-model-Mistral4MLA)                                           | MLA with Llama 4 attention scaling for Mistral 4.                                       |
| [`Mistral4Model`](#nemo_automodel-components-models-mistral4-model-Mistral4Model)                                       | -                                                                                       |
| [`Mistral4TextModelBackend`](#nemo_automodel-components-models-mistral4-model-Mistral4TextModelBackend)                 | Backend-aware Mistral4 text model for use inside the multimodal wrapper.                |

### Functions

| Name                                                                                                  | Description                                                                                |
| ----------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------ |
| [`_build_moe_config`](#nemo_automodel-components-models-mistral4-model-_build_moe_config)             | Build MoEConfig from a Mistral4 text config.                                               |
| [`_get_llama_4_attn_scale`](#nemo_automodel-components-models-mistral4-model-_get_llama_4_attn_scale) | Position-dependent attention scaling for long-context extrapolation (Llama 4 / Mistral 4). |

### Data

[`ModelClass`](#nemo_automodel-components-models-mistral4-model-ModelClass)

[`_HF_MISTRAL3_AVAILABLE`](#nemo_automodel-components-models-mistral4-model-_HF_MISTRAL3_AVAILABLE)

### API

```python
class nemo_automodel.components.models.mistral4.model.Mistral3ForConditionalGeneration(
    config,
    moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    kwargs = {}
)
```

**Bases:** [HFCheckpointingMixin](/nemo-automodel/nemo_automodel/components/models/common/hf_checkpointing_mixin#nemo_automodel-components-models-common-hf_checkpointing_mixin-HFCheckpointingMixin), `Module`, [MoEFSDPSyncMixin](/nemo-automodel/nemo_automodel/components/moe/fsdp_mixin#nemo_automodel-components-moe-fsdp_mixin-MoEFSDPSyncMixin)

Full multimodal Mistral 4: Pixtral vision + projector + Mistral4 MLA/MoE text backbone.

Follows KimiK25VLForConditionalGeneration pattern: inherits from nn.Module
(not HF PreTrainedModel) to avoid FSDP conflicts.

```python
nemo_automodel.components.models.mistral4.model.Mistral3ForConditionalGeneration.forward(
    input_ids: torch.Tensor | None = None,
    position_ids: torch.Tensor | None = None,
    attention_mask: torch.Tensor | None = None,
    padding_mask: torch.Tensor | None = None,
    pixel_values: torch.Tensor | None = None,
    image_sizes: torch.Tensor | None = None,
    inputs_embeds: torch.Tensor | None = None,
    kwargs: typing.Any = {}
) -> torch.Tensor
```

```python
nemo_automodel.components.models.mistral4.model.Mistral3ForConditionalGeneration.from_config(
    config,
    moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    kwargs = {}
)
```

classmethod

```python
nemo_automodel.components.models.mistral4.model.Mistral3ForConditionalGeneration.from_pretrained(
    pretrained_model_name_or_path: str,
    model_args = (),
    kwargs = {}
)
```

classmethod

```python
nemo_automodel.components.models.mistral4.model.Mistral3ForConditionalGeneration.get_input_embeddings()
```

```python
nemo_automodel.components.models.mistral4.model.Mistral3ForConditionalGeneration.get_output_embeddings()
```

```python
nemo_automodel.components.models.mistral4.model.Mistral3ForConditionalGeneration.initialize_weights(
    buffer_device: torch.device | None = None,
    dtype: torch.dtype = torch.bfloat16
) -> None
```

```python
nemo_automodel.components.models.mistral4.model.Mistral3ForConditionalGeneration.set_input_embeddings(
    value
)
```

```python
nemo_automodel.components.models.mistral4.model.Mistral3ForConditionalGeneration.set_output_embeddings(
    new_embeddings
)
```

```python
nemo_automodel.components.models.mistral4.model.Mistral3ForConditionalGeneration.supports_config(
    config
) -> bool
```

classmethod

Only handle configs whose text backbone is Mistral4 (MoE + MLA).

```python
nemo_automodel.components.models.mistral4.model.Mistral3ForConditionalGeneration.update_moe_gate_bias() -> None
```

```python
class nemo_automodel.components.models.mistral4.model.Mistral3Model(
    config,
    vision_tower,
    multi_modal_projector,
    language_model
)
```

**Bases:** `Module`

VLM wrapper composing vision tower + projector + Mistral4 text backend.

Follows KimiK25VLModel pattern: plain nn.Module (not HF PreTrainedModel)
to avoid FSDP conflicts from PreTrainedModel's module registration hooks.
Vision processing logic is replicated from HF Mistral3Model.

```python
nemo_automodel.components.models.mistral4.model.Mistral3Model._get_image_features(
    pixel_values,
    image_sizes,
    vision_feature_layer = -1
)
```

Encode images through vision tower + projector (from HF Mistral3Model).

```python
nemo_automodel.components.models.mistral4.model.Mistral3Model.forward(
    input_ids = None,
    pixel_values = None,
    attention_mask = None,
    position_ids = None,
    past_key_values = None,
    inputs_embeds = None,
    image_sizes = None,
    padding_mask = None,
    kwargs = {}
)
```

```python
nemo_automodel.components.models.mistral4.model.Mistral3Model.get_input_embeddings()
```

```python
class nemo_automodel.components.models.mistral4.model.Mistral4Block(
    layer_idx,
    config,
    moe_config,
    backend
)
```

**Bases:** [Block](/nemo-automodel/nemo_automodel/components/models/deepseek_v3/model#nemo_automodel-components-models-deepseek_v3-model-Block)

Block using Mistral4MLA instead of MLA.

```python
class nemo_automodel.components.models.mistral4.model.Mistral4ForCausalLM(
    config,
    moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    kwargs = {}
)
```

**Bases:** [HFCheckpointingMixin](/nemo-automodel/nemo_automodel/components/models/common/hf_checkpointing_mixin#nemo_automodel-components-models-common-hf_checkpointing_mixin-HFCheckpointingMixin), `Module`, [MoEFSDPSyncMixin](/nemo-automodel/nemo_automodel/components/moe/fsdp_mixin#nemo_automodel-components-moe-fsdp_mixin-MoEFSDPSyncMixin)

```python
nemo_automodel.components.models.mistral4.model.Mistral4ForCausalLM.forward(
    input_ids: torch.Tensor,
    position_ids: torch.Tensor | None = None,
    attention_mask: torch.Tensor | None = None,
    padding_mask: torch.Tensor | None = None,
    logits_to_keep: typing.Union[int, torch.Tensor] = 0,
    output_hidden_states: typing.Optional[bool] = None,
    attn_kwargs: typing.Any = {}
) -> transformers.modeling_outputs.CausalLMOutputWithPast
```

```python
nemo_automodel.components.models.mistral4.model.Mistral4ForCausalLM.from_config(
    config,
    moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    kwargs = {}
)
```

classmethod

```python
nemo_automodel.components.models.mistral4.model.Mistral4ForCausalLM.from_pretrained(
    pretrained_model_name_or_path: str,
    model_args = (),
    kwargs = {}
)
```

classmethod

```python
nemo_automodel.components.models.mistral4.model.Mistral4ForCausalLM.get_input_embeddings()
```

```python
nemo_automodel.components.models.mistral4.model.Mistral4ForCausalLM.get_output_embeddings()
```

```python
nemo_automodel.components.models.mistral4.model.Mistral4ForCausalLM.initialize_weights(
    buffer_device: torch.device | None = None,
    dtype: torch.dtype = torch.bfloat16
) -> None
```

```python
nemo_automodel.components.models.mistral4.model.Mistral4ForCausalLM.set_input_embeddings(
    value
)
```

```python
nemo_automodel.components.models.mistral4.model.Mistral4ForCausalLM.set_output_embeddings(
    new_embeddings
)
```

```python
nemo_automodel.components.models.mistral4.model.Mistral4ForCausalLM.update_moe_gate_bias() -> None
```

```python
class nemo_automodel.components.models.mistral4.model.Mistral4MLA(
    config,
    backend: nemo_automodel.components.models.common.BackendConfig
)
```

**Bases:** [MLA](/nemo-automodel/nemo_automodel/components/models/deepseek_v3/layers#nemo_automodel-components-models-deepseek_v3-layers-MLA)

MLA with Llama 4 attention scaling for Mistral 4.

Compared to DeepSeek V3 MLA, adds position-dependent scaling to q\_pe after RoPE
(llama\_4\_scaling\_beta). RoPE itself uses the same complex-number approach as DSV3.

```python
nemo_automodel.components.models.mistral4.model.Mistral4MLA.forward(
    x: torch.Tensor,
    freqs_cis: torch.Tensor,
    attention_mask: torch.Tensor | None = None,
    attn_kwargs: typing.Any = {}
)
```

```python
class nemo_automodel.components.models.mistral4.model.Mistral4Model(
    config,
    backend: nemo_automodel.components.models.common.BackendConfig,
    moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
    moe_overrides: dict | None = None
)
```

**Bases:** `Module`

```python
nemo_automodel.components.models.mistral4.model.Mistral4Model.forward(
    input_ids: torch.Tensor | None = None,
    inputs_embeds: torch.Tensor | None = None,
    position_ids: torch.Tensor | None = None,
    attention_mask: torch.Tensor | None = None,
    padding_mask: torch.Tensor | None = None,
    attn_kwargs: typing.Any = {}
) -> torch.Tensor
```

```python
nemo_automodel.components.models.mistral4.model.Mistral4Model.init_weights(
    buffer_device: torch.device | None = None
) -> None
```

```python
nemo_automodel.components.models.mistral4.model.Mistral4Model.update_moe_gate_bias() -> None
```

```python
class nemo_automodel.components.models.mistral4.model.Mistral4TextModelBackend(
    config,
    backend: nemo_automodel.components.models.common.BackendConfig,
    moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
    moe_overrides: dict | None = None
)
```

**Bases:** `Module`

Backend-aware Mistral4 text model for use inside the multimodal wrapper.

Wraps Mistral4Model in self.model (like KimiK25VLLanguageModelBackend wraps
DeepseekV3Model). This ensures embed\_tokens/layers/norm are accessed via
@property aliases rather than as direct nn.Module children, which avoids
FSDP double-root-init when the parallelizer wraps both embed\_tokens and
this module.

```python
nemo_automodel.components.models.mistral4.model.Mistral4TextModelBackend.forward(
    input_ids: torch.Tensor | None = None,
    inputs_embeds: torch.Tensor | None = None,
    attention_mask: torch.Tensor | None = None,
    position_ids: torch.Tensor | None = None,
    padding_mask: torch.Tensor | None = None,
    past_key_values = None,
    use_cache: bool | None = None,
    output_attentions: bool | None = None,
    output_hidden_states: bool | None = None,
    return_dict: bool | None = None,
    cache_position: torch.Tensor | None = None,
    kwargs: typing.Any = {}
) -> transformers.modeling_outputs.BaseModelOutputWithPast
```

```python
nemo_automodel.components.models.mistral4.model.Mistral4TextModelBackend.get_input_embeddings()
```

```python
nemo_automodel.components.models.mistral4.model.Mistral4TextModelBackend.init_weights(
    buffer_device: torch.device | None = None
)
```

```python
nemo_automodel.components.models.mistral4.model.Mistral4TextModelBackend.set_input_embeddings(
    value
)
```

```python
nemo_automodel.components.models.mistral4.model._build_moe_config(
    config,
    moe_overrides: dict | None = None
) -> nemo_automodel.components.moe.config.MoEConfig
```

Build MoEConfig from a Mistral4 text config.

```python
nemo_automodel.components.models.mistral4.model._get_llama_4_attn_scale(
    position_ids: torch.Tensor,
    beta: float,
    max_position_embeddings: int
) -> torch.Tensor
```

Position-dependent attention scaling for long-context extrapolation (Llama 4 / Mistral 4).

```python
nemo_automodel.components.models.mistral4.model.ModelClass = Mistral4ForCausalLM
```

```python
nemo_automodel.components.models.mistral4.model._HF_MISTRAL3_AVAILABLE = True
```