nemo_automodel.components.models.hy_mt2.model
nemo_automodel.components.models.hy_mt2.model
HyMT2ForCausalLM — Tencent Hy-MT2-30B-A3B (translation MoE) SFT support.
Architecture (from tencent/Hy-MT2-30B-A3B config.json):
- 48 transformer layers; layer 0 is dense, layers 1-47 are MoE
- MoE: 128 routed experts + 1 shared expert, top-8 activated
- Sigmoid routing with expert-bias correction, router_scaling_factor=2.826
- route_norm = True (normalize top-k weights to sum to 1)
- GQA: 32 Q heads, 4 KV heads, head_dim=128, hidden_size=2048
- Per-head Q/K RMSNorm (qk_norm=True) before RoPE
- 256K context, rope_theta=11158840
- dense intermediate_size=6912, moe_intermediate_size=expert_hidden_dim=768
- vocab_size=120832
- enable_lm_head_fp32 = True (HF reference upcasts lm_head to fp32)
Notes vs. components/models/hy_v3 (Hy3-preview 295B):
- Smaller everywhere (48L / 128 experts / 32+4 heads / hidden=2048).
- Adds an in-model
enable_lm_head_fp32fallback (applies when the YAML’slm_head_precisionis not set). The preferred path is to setdistributed.moe.lm_head_precision: float32in the YAML, which the MoE parallelizer handles viaMixedPrecisionPolicy. score_funcis driven byconfig.moe_router_use_sigmoidinstead of being hard-coded.
Module Contents
Classes
Functions
Data
API
Bases: Module
Single Hy-MT2 transformer block: attention + (dense MLP | MoE) + residual norms.
Bases: HFCheckpointingMixin, Module, MoEFSDPSyncMixin
Hy-MT2-30B-A3B causal-LM wrapper.
Mixes in MoEFSDPSyncMixin so EP / FSDP2 expert-gradient sync works
out of the box (set distributed.ep_size in the YAML; must divide
num_experts=128). The HFCheckpointingMixin provides
from_pretrained / save_pretrained over the HF safetensors layout.
Forward pass returning :class:CausalLMOutputWithPast.
Parameters:
Token IDs. BSHD: [B, S]; THD: [1, T] (squeezed internally).
Optional position indices.
Optional 2D padding mask.
Optional padding mask used by the THD squeeze helper.
If 0 (default) compute logits for all positions; if > 0
only compute logits for the last logits_to_keep positions
(avoids materialising the full logit matrix during generation /
enables fused-linear cross-entropy in training).
Whether to return the final hidden states (the
input to lm_head) on the output. Defaults to the config flag.
Additional attention kwargs forwarded to the base model (e.g. qkv_format, cu_seqlens, CP kwargs).
Returns: CausalLMOutputWithPast
class:~transformers.modeling_outputs.CausalLMOutputWithPast with
Bases: Module
Hy-MT2 backbone: token embeddings + transformer blocks + final RMSNorm.
The MoE / dense split is governed by config.first_k_dense_replace
(layer 0 dense, the rest MoE for the published Hy-MT2-30B-A3B). The
MoE configuration is assembled from the HF config fields and forwarded
to every MoE-bearing Block.
Map config.moe_router_use_sigmoid to a gate score_func name.
Returns “sigmoid” when the flag is True (Hy-MT2 default) and “softmax”
otherwise. The bias-aware variants (“sigmoid_with_bias” /
“softmax_with_bias”) are selected at the gate level by the presence of
e_score_correction_bias plus expert-group routing, which Hy-MT2 does
not use (n_expert_groups=0).