nemo_automodel.components.models.hy_mt2.model#
HyMT2ForCausalLM — Tencent Hy-MT2-30B-A3B (translation MoE) SFT support.
Architecture (from tencent/Hy-MT2-30B-A3B config.json):
48 transformer layers; layer 0 is dense, layers 1-47 are MoE
MoE: 128 routed experts + 1 shared expert, top-8 activated
Sigmoid routing with expert-bias correction, router_scaling_factor=2.826
route_norm = True (normalize top-k weights to sum to 1)
GQA: 32 Q heads, 4 KV heads, head_dim=128, hidden_size=2048
Per-head Q/K RMSNorm (qk_norm=True) before RoPE
256K context, rope_theta=11158840
dense intermediate_size=6912, moe_intermediate_size=expert_hidden_dim=768
vocab_size=120832
enable_lm_head_fp32 = True (HF reference upcasts lm_head to fp32)
Notes vs. components/models/hy_v3 (Hy3-preview 295B):
Smaller everywhere (48L / 128 experts / 32+4 heads / hidden=2048).
Adds an in-model
enable_lm_head_fp32fallback (applies when the YAML’slm_head_precisionis not set). The preferred path is to setdistributed.moe.lm_head_precision: float32in the YAML, which the MoE parallelizer handles viaMixedPrecisionPolicy.score_funcis driven byconfig.moe_router_use_sigmoidinstead of being hard-coded.
Module Contents#
Classes#
Single Hy-MT2 transformer block: attention + (dense MLP | MoE) + residual norms. |
|
Hy-MT2 backbone: token embeddings + transformer blocks + final RMSNorm. |
|
Hy-MT2-30B-A3B causal-LM wrapper. |
Functions#
Map |
Data#
API#
- nemo_automodel.components.models.hy_mt2.model._resolve_score_func(config: Any) str#
Map
config.moe_router_use_sigmoidto a gatescore_funcname.Returns “sigmoid” when the flag is True (Hy-MT2 default) and “softmax” otherwise. The bias-aware variants (“sigmoid_with_bias” / “softmax_with_bias”) are selected at the gate level by the presence of
e_score_correction_biasplus expert-group routing, which Hy-MT2 does not use (n_expert_groups=0).
- class nemo_automodel.components.models.hy_mt2.model.Block(
- layer_idx: int,
- config: Any,
- moe_config: nemo_automodel.components.moe.config.MoEConfig,
- backend: nemo_automodel.components.models.common.BackendConfig,
Bases:
torch.nn.ModuleSingle Hy-MT2 transformer block: attention + (dense MLP | MoE) + residual norms.
Initialization
- forward(
- x: torch.Tensor,
- *,
- freqs_cis: torch.Tensor,
- attention_mask: torch.Tensor | None = None,
- padding_mask: torch.Tensor | None = None,
- **attn_kwargs: Any,
- _mlp(
- x: torch.Tensor,
- padding_mask: torch.Tensor | None,
- init_weights(buffer_device: torch.device)#
- class nemo_automodel.components.models.hy_mt2.model.HyMT2Model(
- config: Any,
- backend: nemo_automodel.components.models.common.BackendConfig,
- *,
- moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
- moe_overrides: dict | None = None,
Bases:
torch.nn.ModuleHy-MT2 backbone: token embeddings + transformer blocks + final RMSNorm.
The MoE / dense split is governed by
config.first_k_dense_replace(layer 0 dense, the rest MoE for the published Hy-MT2-30B-A3B). The MoE configuration is assembled from the HF config fields and forwarded to every MoE-bearingBlock.Initialization
- forward(
- input_ids: torch.Tensor,
- *,
- position_ids: torch.Tensor | None = None,
- attention_mask: torch.Tensor | None = None,
- padding_mask: torch.Tensor | None = None,
- **attn_kwargs: Any,
- init_weights(buffer_device: torch.device | None = None) None#
- class nemo_automodel.components.models.hy_mt2.model.HyMT2ForCausalLM(
- config: Any,
- moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
- backend: nemo_automodel.components.models.common.BackendConfig | None = None,
- **kwargs,
Bases:
nemo_automodel.components.models.common.hf_checkpointing_mixin.HFCheckpointingMixin,torch.nn.Module,nemo_automodel.components.moe.fsdp_mixin.MoEFSDPSyncMixinHy-MT2-30B-A3B causal-LM wrapper.
Mixes in
MoEFSDPSyncMixinso EP / FSDP2 expert-gradient sync works out of the box (setdistributed.ep_sizein the YAML; must dividenum_experts=128). TheHFCheckpointingMixinprovidesfrom_pretrained/save_pretrainedover the HF safetensors layout.Initialization
- classmethod from_config(
- config: Any,
- moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
- backend: nemo_automodel.components.models.common.BackendConfig | None = None,
- **kwargs,
- classmethod from_pretrained(
- pretrained_model_name_or_path: str,
- *model_args,
- **kwargs,
- get_input_embeddings()#
- set_input_embeddings(value)#
- get_output_embeddings()#
- set_output_embeddings(new_embeddings)#
- forward(
- input_ids: torch.Tensor,
- *,
- position_ids: torch.Tensor | None = None,
- attention_mask: torch.Tensor | None = None,
- padding_mask: torch.Tensor | None = None,
- **attn_kwargs: Any,
- update_moe_gate_bias() None#
- initialize_weights(
- buffer_device: torch.device | None = None,
- dtype: torch.dtype = torch.bfloat16,
- nemo_automodel.components.models.hy_mt2.model.ModelClass#
None