bridge.models.ministral3.ministral3_provider#

Ministral 3 Model Provider configurations for Megatron-Core.

This module provides configuration classes for Ministral 3 models (3B, 8B, 14B variants), compatible with HuggingFace’s Ministral-3 model configurations.

Reference: https://huggingface.co/mistralai/Ministral-3-3B-Base-2512

Ministral 3 Key Features:

  • Vision-language capabilities with separate language model and vision encoder

  • Large context window (up to 256k tokens)

  • Available in Base, Instruct, and Reasoning variants

  • Edge-optimized for deployment on various hardware

Module Contents#

Classes#

Ministral3ModelProvider

Base model provider for Ministral 3 Vision-Language Models.

Ministral3ModelProvider3B

Config for Ministral 3 3B Vision-Language Model.

Ministral3ModelProvider8B

Config for Ministral 3 8B Vision-Language Model.

Ministral3ModelProvider14B

Config for Ministral 3 14B Vision-Language Model.

MinistralTEDotProductAttention

Implementation of the TEDotProductAttention mechanism for Ministral (Mistral) 3 models with Llama 4-style attention scaling.

Functions#

ministral_layer_spec

Layer spec for Ministral 3 models.

Data#

API#

bridge.models.ministral3.ministral3_provider.logger#

‘getLogger(…)’

bridge.models.ministral3.ministral3_provider.ministral_layer_spec(
config: megatron.bridge.models.gpt_provider.GPTModelProvider,
) megatron.core.transformer.ModuleSpec#

Layer spec for Ministral 3 models.

class bridge.models.ministral3.ministral3_provider.Ministral3ModelProvider#

Bases: megatron.bridge.models.mistral.mistral_provider.MistralModelProvider

Base model provider for Ministral 3 Vision-Language Models.

Ministral 3 is a family of edge-optimized vision-language models combining a language model with a vision encoder for multimodal capabilities.

Reference:

  • https://huggingface.co/mistralai/Ministral-3-3B-Base-2512

  • https://huggingface.co/mistralai/Ministral-3-8B-Base-2512

  • https://huggingface.co/mistralai/Ministral-3-14B-Base-2512

transformer_layer_spec: Union[megatron.core.transformer.ModuleSpec, Callable[[megatron.bridge.models.gpt_provider.GPTModelProvider], megatron.core.transformer.ModuleSpec]]#

None

normalization: str#

‘RMSNorm’

activation_func: Callable#

None

add_bias_linear: bool#

False

gated_linear_unit: bool#

True

num_attention_heads: int#

32

num_query_groups: int#

8

kv_channels: int#

128

seq_length: int#

32768

position_embedding_type: str#

‘yarn’

rotary_base: int#

1000000

yarn_rotary_scaling_factor: float#

16.0

yarn_original_max_position_embeddings: int#

16384

yarn_beta_fast: float#

32.0

yarn_beta_slow: float#

1.0

yarn_correction_range_round_to_int: bool#

False

yarn_mscale: Optional[float]#

1.0

yarn_mscale_all_dim: Optional[float]#

1.0

attention_dropout: float#

0.0

hidden_dropout: float#

0.0

share_embeddings_and_output_weights: bool#

False

init_method_std: float#

0.02

layernorm_epsilon: float#

1e-05

params_dtype: torch.dtype#

None

bf16: bool#

True

scatter_embedding_sequence_parallel: bool#

False

hf_config: Optional[transformers.models.mistral3.configuration_mistral3.Mistral3Config]#

None

image_token_id: int#

10

freeze_language_model: bool#

False

freeze_vision_model: bool#

False

freeze_vision_projection: bool#

False

provide(
pre_process=None,
post_process=None,
vp_stage=None,
) megatron.bridge.models.ministral3.modeling_ministral3.Ministral3Model#

Provide a Ministral3Model instance with vision and language components.

Parameters:
  • pre_process – Whether this is the first stage in pipeline parallelism

  • post_process – Whether this is the last stage in pipeline parallelism

  • vp_stage – Virtual pipeline stage number

Returns:

Ministral3Model instance with HF vision encoder and Megatron language model

provide_language_model(
pre_process=None,
post_process=None,
vp_stage=None,
) megatron.core.models.gpt.GPTModel#

Provide just the language model component without vision.

Parameters:
  • pre_process – Whether this is the first stage in pipeline parallelism

  • post_process – Whether this is the last stage in pipeline parallelism

  • vp_stage – Virtual pipeline stage number

Returns:

MCoreGPTModel instance (language model only)

class bridge.models.ministral3.ministral3_provider.Ministral3ModelProvider3B#

Bases: bridge.models.ministral3.ministral3_provider.Ministral3ModelProvider

Config for Ministral 3 3B Vision-Language Model.

Reference: https://huggingface.co/mistralai/Ministral-3-3B-Base-2512

Model specs:

  • 3.4B Language Model + 0.4B Vision Encoder

hidden_size: int#

3072

ffn_hidden_size: int#

9216

num_layers: int#

26

share_embeddings_and_output_weights: bool#

True

class bridge.models.ministral3.ministral3_provider.Ministral3ModelProvider8B#

Bases: bridge.models.ministral3.ministral3_provider.Ministral3ModelProvider

Config for Ministral 3 8B Vision-Language Model.

Reference: https://huggingface.co/mistralai/Ministral-3-8B-Base-2512

Model specs:

  • 8.4B Language Model + 0.4B Vision Encoder

hidden_size: int#

4096

ffn_hidden_size: int#

14336

num_layers: int#

34

class bridge.models.ministral3.ministral3_provider.Ministral3ModelProvider14B#

Bases: bridge.models.ministral3.ministral3_provider.Ministral3ModelProvider

Config for Ministral 3 14B Vision-Language Model.

Reference: https://huggingface.co/mistralai/Ministral-3-14B-Base-2512

Model specs:

  • 13.5B Language Model + 0.4B Vision Encoder

hidden_size: int#

5120

ffn_hidden_size: int#

16384

num_layers: int#

40

rotary_base: int#

1000000000.0

class bridge.models.ministral3.ministral3_provider.MinistralTEDotProductAttention(
config,
layer_number: int,
attn_mask_type: megatron.core.transformer.enums.AttnMaskType,
attention_type: str,
attention_dropout: Optional[float] = None,
**kwargs,
)#

Bases: megatron.core.extensions.transformer_engine.TEDotProductAttention

Implementation of the TEDotProductAttention mechanism for Ministral (Mistral) 3 models with Llama 4-style attention scaling.

This class extends MCoreTEDotProductAttention by introducing the Llama 4 attention scaling factor, which is essential for robust long-context training. During the forward pass, a position-dependent scaling (1 + beta * log(1 + floor(positions / max_position_embeddings))) is applied to the query vectors. This approach, introduced in Llama 4, helps maintain stability and performance as context length increases, enabling effective training and inference on extended sequences (e.g., up to 256k tokens).

Key difference from MCoreTEDotProductAttention:

  • Applies the Llama 4 scaling factor to the queries prior to standard attention computation for improved long-context capability.

Initialization

_get_llama_4_attn_scale(
positions_ids: torch.Tensor,
beta: float,
max_position_embeddings: int,
) torch.Tensor#
forward(
query: torch.Tensor,
key: torch.Tensor,
value: torch.Tensor,
attention_mask: torch.Tensor,
attn_mask_type: megatron.core.transformer.enums.AttnMaskType,
**kwargs,
)#