nemo_automodel.components.models.hy_v3.model
nemo_automodel.components.models.hy_v3.model
HYV3ForCausalLM — Tencent Hy3-preview (295B MoE) SFT support.
Architecture (from tencent/Hy3-preview config.json):
- 80 transformer layers; layer 0 is dense, layers 1-79 are MoE
- MoE: 192 routed experts + 1 shared expert, top-8 activated
- Sigmoid routing with expert-bias correction (e_score_correction_bias)
- GQA: 64 Q heads, 8 KV heads, head_dim=128
- Per-head QK RMSNorm before RoPE
- 256K context, rope_theta=11158840
Module Contents
Classes
Data
API
Bases: Module
Bases: HFCheckpointingMixin, Module, MoEFSDPSyncMixin
Forward pass returning :class:~transformers.modeling_outputs.CausalLMOutputWithPast.
Parameters:
Input token IDs. BSHD: [B, S]; THD: [1, T] (squeezed internally).
Optional position indices.
Optional 2D padding mask [B, S].
Optional padding mask used by the THD squeeze helper.
If 0 (default) compute logits for all positions; otherwise
only compute logits for the last logits_to_keep token positions
(avoids materialising the full logit matrix during generation).
Whether to carry the final hidden states on the output.
Additional arguments forwarded to the base model (e.g. qkv_format, cu_seqlens, CP kwargs).
Returns: CausalLMOutputWithPast
class:~transformers.modeling_outputs.CausalLMOutputWithPast with logits
Bases: Module