bridge.models.ministral3.ministral3_provider#
Ministral 3 Model Provider configuration for Megatron-Core.
This module provides a provider class for Ministral 3 models, compatible with HuggingFace’s Ministral-3 model configurations.
Reference: https://huggingface.co/mistralai/Ministral-3-3B-Base-2512
Ministral 3 Key Features:
Vision-language capabilities with separate language model and vision encoder
Large context window (up to 256k tokens)
Available in Base, Instruct, and Reasoning variants
Edge-optimized for deployment on various hardware
Module Contents#
Classes#
Base model provider for Ministral 3 Vision-Language Models. |
|
Implementation of the TEDotProductAttention mechanism for Ministral (Mistral) 3 models with Llama 4-style attention scaling. |
Functions#
Layer spec for Ministral 3 models. |
Data#
API#
- bridge.models.ministral3.ministral3_provider.logger#
‘getLogger(…)’
- bridge.models.ministral3.ministral3_provider.ministral_layer_spec(
- config: megatron.bridge.models.gpt_provider.GPTModelProvider,
Layer spec for Ministral 3 models.
- class bridge.models.ministral3.ministral3_provider.Ministral3ModelProvider#
Bases:
megatron.bridge.models.mistral.mistral_provider.MistralModelProviderBase model provider for Ministral 3 Vision-Language Models.
Ministral 3 is a family of edge-optimized vision-language models combining a language model with a vision encoder for multimodal capabilities.
Reference:
https://huggingface.co/mistralai/Ministral-3-3B-Base-2512
https://huggingface.co/mistralai/Ministral-3-8B-Base-2512
https://huggingface.co/mistralai/Ministral-3-14B-Base-2512
- transformer_layer_spec: Union[megatron.core.transformer.ModuleSpec, Callable[[megatron.bridge.models.gpt_provider.GPTModelProvider], megatron.core.transformer.ModuleSpec]]#
None
- normalization: str#
‘RMSNorm’
- activation_func: Callable#
None
- add_bias_linear: bool#
False
- gated_linear_unit: bool#
True
- num_attention_heads: int#
32
- num_query_groups: int#
8
- kv_channels: int#
128
- seq_length: int#
32768
- position_embedding_type: str#
‘yarn’
- rotary_base: int#
1000000
- yarn_rotary_scaling_factor: float#
16.0
- yarn_original_max_position_embeddings: int#
16384
- yarn_beta_fast: float#
32.0
- yarn_beta_slow: float#
1.0
- yarn_correction_range_round_to_int: bool#
False
- yarn_mscale: Optional[float]#
1.0
- yarn_mscale_all_dim: Optional[float]#
1.0
- attention_dropout: float#
0.0
0.0
False
- init_method_std: float#
0.02
- layernorm_epsilon: float#
1e-05
- params_dtype: torch.dtype#
None
- bf16: bool#
True
- scatter_embedding_sequence_parallel: bool#
False
- hf_config: Optional[transformers.models.mistral3.configuration_mistral3.Mistral3Config]#
None
- image_token_id: int#
10
- freeze_language_model: bool#
False
- freeze_vision_model: bool#
False
- freeze_vision_projection: bool#
False
- provide(
- pre_process=None,
- post_process=None,
- vp_stage=None,
Provide a Ministral3Model instance with vision and language components.
- Parameters:
pre_process – Whether this is the first stage in pipeline parallelism
post_process – Whether this is the last stage in pipeline parallelism
vp_stage – Virtual pipeline stage number
- Returns:
Ministral3Model instance with HF vision encoder and Megatron language model
- provide_language_model(
- pre_process=None,
- post_process=None,
- vp_stage=None,
Provide just the language model component without vision.
- Parameters:
pre_process – Whether this is the first stage in pipeline parallelism
post_process – Whether this is the last stage in pipeline parallelism
vp_stage – Virtual pipeline stage number
- Returns:
MCoreGPTModel instance (language model only)
- class bridge.models.ministral3.ministral3_provider.MinistralTEDotProductAttention(
- config,
- layer_number: int,
- attn_mask_type: megatron.core.transformer.enums.AttnMaskType,
- attention_type: str,
- attention_dropout: Optional[float] = None,
- **kwargs,
Bases:
megatron.core.extensions.transformer_engine.TEDotProductAttentionImplementation of the TEDotProductAttention mechanism for Ministral (Mistral) 3 models with Llama 4-style attention scaling.
This class extends MCoreTEDotProductAttention by introducing the Llama 4 attention scaling factor, which is essential for robust long-context training. During the forward pass, a position-dependent scaling (1 + beta * log(1 + floor(positions / max_position_embeddings))) is applied to the query vectors. This approach, introduced in Llama 4, helps maintain stability and performance as context length increases, enabling effective training and inference on extended sequences (e.g., up to 256k tokens).
Key difference from MCoreTEDotProductAttention:
Applies the Llama 4 scaling factor to the queries prior to standard attention computation for improved long-context capability.
Initialization
- static _get_llama_4_attn_scale(
- positions_ids: torch.Tensor,
- beta: float,
- max_position_embeddings: int,
- query_shape: tuple,
- forward(
- query: torch.Tensor,
- key: torch.Tensor,
- value: torch.Tensor,
- attention_mask: torch.Tensor,
- attn_mask_type: megatron.core.transformer.enums.AttnMaskType,
- **kwargs,