bridge.models.ministral3.ministral3_provider#
Ministral 3 Model Provider configurations for Megatron-Core.
This module provides configuration classes for Ministral 3 models (3B, 8B, 14B variants), compatible with HuggingFace’s Ministral-3 model configurations.
Reference: https://huggingface.co/mistralai/Ministral-3-3B-Base-2512
Ministral 3 Key Features:
Vision-language capabilities with separate language model and vision encoder
Large context window (up to 256k tokens)
Available in Base, Instruct, and Reasoning variants
Edge-optimized for deployment on various hardware
Module Contents#
Classes#
Base model provider for Ministral 3 Vision-Language Models. |
|
Config for Ministral 3 3B Vision-Language Model. |
|
Config for Ministral 3 8B Vision-Language Model. |
|
Config for Ministral 3 14B Vision-Language Model. |
|
Implementation of the TEDotProductAttention mechanism for Ministral (Mistral) 3 models with Llama 4-style attention scaling. |
Functions#
Layer spec for Ministral 3 models. |
Data#
API#
- bridge.models.ministral3.ministral3_provider.logger#
‘getLogger(…)’
- bridge.models.ministral3.ministral3_provider.ministral_layer_spec(
- config: megatron.bridge.models.gpt_provider.GPTModelProvider,
Layer spec for Ministral 3 models.
- class bridge.models.ministral3.ministral3_provider.Ministral3ModelProvider#
Bases:
megatron.bridge.models.mistral.mistral_provider.MistralModelProviderBase model provider for Ministral 3 Vision-Language Models.
Ministral 3 is a family of edge-optimized vision-language models combining a language model with a vision encoder for multimodal capabilities.
Reference:
https://huggingface.co/mistralai/Ministral-3-3B-Base-2512
https://huggingface.co/mistralai/Ministral-3-8B-Base-2512
https://huggingface.co/mistralai/Ministral-3-14B-Base-2512
- transformer_layer_spec: Union[megatron.core.transformer.ModuleSpec, Callable[[megatron.bridge.models.gpt_provider.GPTModelProvider], megatron.core.transformer.ModuleSpec]]#
None
- normalization: str#
‘RMSNorm’
- activation_func: Callable#
None
- add_bias_linear: bool#
False
- gated_linear_unit: bool#
True
- num_attention_heads: int#
32
- num_query_groups: int#
8
- kv_channels: int#
128
- seq_length: int#
32768
- position_embedding_type: str#
‘yarn’
- rotary_base: int#
1000000
- yarn_rotary_scaling_factor: float#
16.0
- yarn_original_max_position_embeddings: int#
16384
- yarn_beta_fast: float#
32.0
- yarn_beta_slow: float#
1.0
- yarn_correction_range_round_to_int: bool#
False
- yarn_mscale: Optional[float]#
1.0
- yarn_mscale_all_dim: Optional[float]#
1.0
- attention_dropout: float#
0.0
0.0
False
- init_method_std: float#
0.02
- layernorm_epsilon: float#
1e-05
- params_dtype: torch.dtype#
None
- bf16: bool#
True
- scatter_embedding_sequence_parallel: bool#
False
- hf_config: Optional[transformers.models.mistral3.configuration_mistral3.Mistral3Config]#
None
- image_token_id: int#
10
- freeze_language_model: bool#
False
- freeze_vision_model: bool#
False
- freeze_vision_projection: bool#
False
- provide(
- pre_process=None,
- post_process=None,
- vp_stage=None,
Provide a Ministral3Model instance with vision and language components.
- Parameters:
pre_process – Whether this is the first stage in pipeline parallelism
post_process – Whether this is the last stage in pipeline parallelism
vp_stage – Virtual pipeline stage number
- Returns:
Ministral3Model instance with HF vision encoder and Megatron language model
- provide_language_model(
- pre_process=None,
- post_process=None,
- vp_stage=None,
Provide just the language model component without vision.
- Parameters:
pre_process – Whether this is the first stage in pipeline parallelism
post_process – Whether this is the last stage in pipeline parallelism
vp_stage – Virtual pipeline stage number
- Returns:
MCoreGPTModel instance (language model only)
- class bridge.models.ministral3.ministral3_provider.Ministral3ModelProvider3B#
Bases:
bridge.models.ministral3.ministral3_provider.Ministral3ModelProviderConfig for Ministral 3 3B Vision-Language Model.
Reference: https://huggingface.co/mistralai/Ministral-3-3B-Base-2512
Model specs:
3.4B Language Model + 0.4B Vision Encoder
3072
9216
- num_layers: int#
26
True
- class bridge.models.ministral3.ministral3_provider.Ministral3ModelProvider8B#
Bases:
bridge.models.ministral3.ministral3_provider.Ministral3ModelProviderConfig for Ministral 3 8B Vision-Language Model.
Reference: https://huggingface.co/mistralai/Ministral-3-8B-Base-2512
Model specs:
8.4B Language Model + 0.4B Vision Encoder
4096
14336
- num_layers: int#
34
- class bridge.models.ministral3.ministral3_provider.Ministral3ModelProvider14B#
Bases:
bridge.models.ministral3.ministral3_provider.Ministral3ModelProviderConfig for Ministral 3 14B Vision-Language Model.
Reference: https://huggingface.co/mistralai/Ministral-3-14B-Base-2512
Model specs:
13.5B Language Model + 0.4B Vision Encoder
5120
16384
- num_layers: int#
40
- rotary_base: int#
1000000000.0
- class bridge.models.ministral3.ministral3_provider.MinistralTEDotProductAttention(
- config,
- layer_number: int,
- attn_mask_type: megatron.core.transformer.enums.AttnMaskType,
- attention_type: str,
- attention_dropout: Optional[float] = None,
- **kwargs,
Bases:
megatron.core.extensions.transformer_engine.TEDotProductAttentionImplementation of the TEDotProductAttention mechanism for Ministral (Mistral) 3 models with Llama 4-style attention scaling.
This class extends MCoreTEDotProductAttention by introducing the Llama 4 attention scaling factor, which is essential for robust long-context training. During the forward pass, a position-dependent scaling (1 + beta * log(1 + floor(positions / max_position_embeddings))) is applied to the query vectors. This approach, introduced in Llama 4, helps maintain stability and performance as context length increases, enabling effective training and inference on extended sequences (e.g., up to 256k tokens).
Key difference from MCoreTEDotProductAttention:
Applies the Llama 4 scaling factor to the queries prior to standard attention computation for improved long-context capability.
Initialization
- _get_llama_4_attn_scale(
- positions_ids: torch.Tensor,
- beta: float,
- max_position_embeddings: int,
- forward(
- query: torch.Tensor,
- key: torch.Tensor,
- value: torch.Tensor,
- attention_mask: torch.Tensor,
- attn_mask_type: megatron.core.transformer.enums.AttnMaskType,
- **kwargs,