nemo_automodel.components.models.ling_v2.model
nemo_automodel.components.models.ling_v2.model
BailingMoeV2 model (Ling 2.0 family).
Architecture summary (from the public inclusionAI/Ling-{mini,flash,1T}-2.0
checkpoints):
- GQA attention with per-head QK-RMSNorm and partial RoPE
(rotates the first
head_dim * partial_rotary_factorchannels only). first_k_dense_replacedense MLP layers at the start of the stack; the remaining layers are sigmoid-routed grouped MoE with shared experts and an aux-loss-free per-expert bias (DeepSeek-V3-style routing).- Single shared expert with intermediate size
moe_intermediate_size. - MTP heads (
num_nextn_predict_layers) are disabled in all published checkpoints and intentionally not modeled here.
Example (YAML):
Module Contents
Classes
Data
API
Bases: HFCheckpointingMixin, Module, MoEFSDPSyncMixin
Causal-LM head wrapping BailingMoeV2Model.
Forward pass returning CausalLMOutputWithPast.
Supports BSHD (input_ids shape [B, S]) and THD (squeezed to [T]
when attn_kwargs["qkv_format"] == "thd") formats.
Parameters:
Input token IDs.
Optional position indices.
Optional 2D padding mask.
Optional padding mask used by the THD squeeze helper.
If > 0, only compute logits for the last
logits_to_keep positions (avoids materialising the full logit
matrix during generation / fused-CE training). 0 computes all
positions.
Whether to return the final hidden states (the
input to lm_head) on the output. Required by the fused
cross-entropy (cut-CE) training path.
Additional arguments forwarded to the base model.
Returns: CausalLMOutputWithPast
class:~transformers.modeling_outputs.CausalLMOutputWithPast with
Return parallelism capabilities for a specific Ling/Bailing-MoE config.
Three checkpoint variants share this class:
inclusionAI/Ling-1T— 1T-param MoE, requires PP. Demonstrated by examples/llm_finetune/ling/ling_1t_sft.yaml (pp=4) and ling_1t_lora_pp.yaml (pp=8); both with ep_size>=8.inclusionAI/Ling-flash-2.0— mid-size MoE, single-rank EP only. Demonstrated by ling_flash_2_0_sft.yaml / ling_flash_2_0_lora.yaml (pp=1, ep=8-32).inclusionAI/Ling-mini-2.0— small MoE, single-rank EP only. Demonstrated by ling_mini_2_0_{hellaswag,sft,squad}.yaml (pp=1, ep=4-8).
Dispatch is on num_hidden_layers since Ling-1T (~80 layers) is well separated from Ling-flash-2.0 (~32) and Ling-mini-2.0 (~20).
Bases: Module
Embedding + decoder stack + final norm. No LM head.
No-op for SFT; published Ling checkpoints freeze the expert_bias buffer.
Bases: Module
Single transformer block: attention + (dense MLP or MoE) + residuals.