core.models.mamba.mamba_model#
Module Contents#
Classes#
Mamba language model. |
Data#
API#
- core.models.mamba.mamba_model.logger#
‘getLogger(…)’
- class core.models.mamba.mamba_model.MambaModel(
- config: megatron.core.transformer.TransformerConfig,
- mamba_stack_spec: megatron.core.transformer.spec_utils.ModuleSpec,
- vocab_size: int,
- max_sequence_length: int,
- hybrid_layer_pattern: Optional[str] = None,
- hybrid_attention_ratio: Optional[float] = None,
- hybrid_mlp_ratio: Optional[float] = None,
- hybrid_override_pattern: Optional[str] = None,
- pre_process: bool = True,
- post_process: bool = True,
- fp16_lm_cross_entropy: bool = False,
- parallel_output: bool = True,
- share_embeddings_and_output_weights: bool = False,
- position_embedding_type: Literal[learned_absolute, rope, none] = 'none',
- rotary_percent: float = 1.0,
- rotary_base: int = 10000,
- scatter_embedding_sequence_parallel: bool = True,
- seq_len_interpolation_factor: Optional[float] = None,
- pg_collection: Optional[megatron.core.process_groups_config.ProcessGroupCollection] = None,
- vp_stage: Optional[int] = None,
Bases:
megatron.core.models.common.language_module.language_module.LanguageModuleMamba language model.
- Parameters:
config (TransformerConfig) – Model config
mamba_stack_spec (ModuleSpec) – Specifies the modules to use for the various layer types
vocab_size (int) – Vocabulary size
max_sequence_length (int) – maximum size of sequence. This is used for positional embedding
hybrid_layer_pattern (str) –
Unified hybrid layer pattern with optional MTP and pipeline stage boundaries. Format: “<main_pattern>/<mtp_pattern>/<mtp_pattern>/…” The main pattern may contain “|” to define pipeline stage boundaries. .. rubric:: Examples
”MM” -> main decoder only, no MTP
”MM/MM/MM” -> main=”MM”, mtp=”MM”, 2 depths
”M-M-|M-M*-|M-M-|M-M*-” -> 4 pipeline segments
hybrid_attention_ratio (float, optional) – Deprecated. Use hybrid_layer_pattern instead. If set to a value > 0.0 and hybrid_layer_pattern is None, a pattern will be generated from the ratio with a deprecation warning.
hybrid_mlp_ratio (float, optional) – Deprecated. Use hybrid_layer_pattern instead. If set to a value > 0.0 and hybrid_layer_pattern is None, a pattern will be generated from the ratio with a deprecation warning.
hybrid_override_pattern (str, optional) – Deprecated. Use hybrid_layer_pattern instead. If set and hybrid_layer_pattern is None, the value is copied to hybrid_layer_pattern with a deprecation warning.
pre_process (bool, optional) – Include embedding layer (used with pipeline parallelism). Defaults to True.
post_process (bool, optional) – Include an output layer (used with pipeline parallelism). Defaults to True.
fp16_lm_cross_entropy (bool, optional) – Defaults to False.
parallel_output (bool, optional) – Do not gather the outputs, keep them split across tensor parallel ranks. Defaults to True.
share_embeddings_and_output_weights (bool, optional) – When True, input embeddings and output logit weights are shared. Defaults to False.
position_embedding_type (Literal[learned_absolute,rope,none], optional) – Position embedding type. Defaults to ‘none’.
rotary_percent (float, optional) – Percent of rotary dimension to use for rotary position embeddings. Ignored unless position_embedding_type is ‘rope’. Defaults to 1.0.
rotary_base (int, optional) – Base period for rotary position embeddings. Ignored unless position_embedding_type is ‘rope’. Defaults to 10000.
seq_len_interpolation_factor (Optional[float], optional) – scale of linearly interpolating RoPE for longer sequences. The value must be a float larger than 1.0. Defaults to None.
pg_collection (ProcessGroupCollection, optional) – Model communication process groups.
vp_stage (Optional[int], optional) – Virtual pipeline stage index. Defaults to None.
Initialization
- set_input_tensor(input_tensor: torch.Tensor) None#
Sets input tensor to the model.
See megatron.model.transformer.set_input_tensor()
- Parameters:
input_tensor (Tensor) – Sets the input tensor for the model.
- forward(
- input_ids: torch.Tensor,
- position_ids: torch.Tensor,
- attention_mask: torch.Tensor,
- decoder_input: torch.Tensor = None,
- labels: torch.Tensor = None,
- inference_context: megatron.core.inference.contexts.BaseInferenceContext = None,
- runtime_gather_output: Optional[bool] = None,
- *,
- inference_params: Optional[megatron.core.inference.contexts.BaseInferenceContext] = None,
- loss_mask: Optional[torch.Tensor] = None,
- packed_seq_params: Optional[megatron.core.packed_seq_params.PackedSeqParams] = None,
- padding_mask: Optional[torch.Tensor] = None,
Forward function of the Mamba model. This function passes the input tensors through the embedding layer, and then the decoder and finally into the post processing layer (optional).
It either returns the Loss values if labels are given or the final hidden units