nemo_automodel.components.models.ling_v2.model#
BailingMoeV2 model (Ling 2.0 family).
Architecture summary (from the public inclusionAI/Ling-{mini,flash,1T}-2.0
checkpoints):
GQA attention with per-head QK-RMSNorm and partial RoPE (rotates the first
head_dim * partial_rotary_factorchannels only).first_k_dense_replacedense MLP layers at the start of the stack; the remaining layers are sigmoid-routed grouped MoE with shared experts and an aux-loss-free per-expert bias (DeepSeek-V3-style routing).Single shared expert with intermediate size
moe_intermediate_size.MTP heads (
num_nextn_predict_layers) are disabled in all published checkpoints and intentionally not modeled here.
Example (YAML):
model:
_target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
pretrained_model_name_or_path: inclusionAI/Ling-mini-2.0
Module Contents#
Classes#
Single transformer block: attention + (dense MLP or MoE) + residuals. |
|
Embedding + decoder stack + final norm. No LM head. |
|
Causal-LM head wrapping |
Data#
API#
- class nemo_automodel.components.models.ling_v2.model.Block(
- layer_idx: int,
- config: nemo_automodel.components.models.ling_v2.config.BailingMoeV2Config,
- moe_config: nemo_automodel.components.moe.config.MoEConfig,
- backend: nemo_automodel.components.models.common.BackendConfig,
Bases:
torch.nn.ModuleSingle transformer block: attention + (dense MLP or MoE) + residuals.
Initialization
- forward(
- x: torch.Tensor,
- *,
- freqs_cis: torch.Tensor,
- attention_mask: torch.Tensor | None = None,
- padding_mask: torch.Tensor | None = None,
- **attn_kwargs: Any,
- _mlp(
- x: torch.Tensor,
- padding_mask: torch.Tensor | None,
- init_weights(buffer_device: torch.device) None#
- class nemo_automodel.components.models.ling_v2.model.BailingMoeV2Model(
- config: nemo_automodel.components.models.ling_v2.config.BailingMoeV2Config,
- backend: nemo_automodel.components.models.common.BackendConfig,
- *,
- moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
- moe_overrides: dict | None = None,
Bases:
torch.nn.ModuleEmbedding + decoder stack + final norm. No LM head.
Initialization
- forward(
- input_ids: torch.Tensor | None = None,
- *,
- inputs_embeds: torch.Tensor | None = None,
- position_ids: torch.Tensor | None = None,
- attention_mask: torch.Tensor | None = None,
- padding_mask: torch.Tensor | None = None,
- **attn_kwargs: Any,
- update_moe_gate_bias() None#
No-op for SFT; published Ling checkpoints freeze the expert_bias buffer.
- init_weights(buffer_device: torch.device | None = None) None#
- class nemo_automodel.components.models.ling_v2.model.BailingMoeV2ForCausalLM(
- config: nemo_automodel.components.models.ling_v2.config.BailingMoeV2Config,
- moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
- backend: nemo_automodel.components.models.common.BackendConfig | None = None,
- **kwargs,
Bases:
nemo_automodel.components.models.common.hf_checkpointing_mixin.HFCheckpointingMixin,torch.nn.Module,nemo_automodel.components.moe.fsdp_mixin.MoEFSDPSyncMixinCausal-LM head wrapping
BailingMoeV2Model.Initialization
- _keep_in_fp32_modules_strict#
[‘e_score_correction_bias’]
- _pp_keep_self_forward: bool#
True
- classmethod from_config(
- config: nemo_automodel.components.models.ling_v2.config.BailingMoeV2Config,
- moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
- backend: nemo_automodel.components.models.common.BackendConfig | None = None,
- **kwargs,
- classmethod from_pretrained(
- pretrained_model_name_or_path: str,
- *model_args,
- **kwargs,
- get_input_embeddings()#
- set_input_embeddings(value)#
- get_output_embeddings()#
- set_output_embeddings(new_embeddings)#
- forward(
- input_ids: torch.Tensor,
- *,
- position_ids: torch.Tensor | None = None,
- attention_mask: torch.Tensor | None = None,
- padding_mask: torch.Tensor | None = None,
- **attn_kwargs: Any,
- update_moe_gate_bias() None#
- initialize_weights(
- buffer_device: torch.device | None = None,
- dtype: torch.dtype = torch.bfloat16,
- nemo_automodel.components.models.ling_v2.model.ModelClass#
None