> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.models.ling_v2.model

BailingMoeV2 model (Ling 2.0 family).

Architecture summary (from the public `inclusionAI/Ling-&#123;mini,flash,1T&#125;-2.0`
checkpoints):

* GQA attention with per-head QK-RMSNorm and partial RoPE
  (rotates the first `head_dim * partial_rotary_factor` channels only).
* `first_k_dense_replace` dense MLP layers at the start of the stack;
  the remaining layers are sigmoid-routed grouped MoE with shared experts
  and an aux-loss-free per-expert bias (DeepSeek-V3-style routing).
* Single shared expert with intermediate size `moe_intermediate_size`.
* MTP heads (`num_nextn_predict_layers`) are disabled in all published
  checkpoints and intentionally not modeled here.

Example (YAML):

```python
model:
  _target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
  pretrained_model_name_or_path: inclusionAI/Ling-mini-2.0
```

## Module Contents

### Classes

| Name                                                                                                 | Description                                                           |
| ---------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------- |
| [`BailingMoeV2ForCausalLM`](#nemo_automodel-components-models-ling_v2-model-BailingMoeV2ForCausalLM) | Causal-LM head wrapping `BailingMoeV2Model`.                          |
| [`BailingMoeV2Model`](#nemo_automodel-components-models-ling_v2-model-BailingMoeV2Model)             | Embedding + decoder stack + final norm.  No LM head.                  |
| [`Block`](#nemo_automodel-components-models-ling_v2-model-Block)                                     | Single transformer block: attention + (dense MLP or MoE) + residuals. |

### Data

[`ModelClass`](#nemo_automodel-components-models-ling_v2-model-ModelClass)

### API

```python
class nemo_automodel.components.models.ling_v2.model.BailingMoeV2ForCausalLM(
    config: nemo_automodel.components.models.ling_v2.config.BailingMoeV2Config,
    moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    kwargs = {}
)
```

**Bases:** [HFCheckpointingMixin](/nemo-automodel/nemo_automodel/components/models/common/hf_checkpointing_mixin#nemo_automodel-components-models-common-hf_checkpointing_mixin-HFCheckpointingMixin), `Module`, [MoEFSDPSyncMixin](/nemo-automodel/nemo_automodel/components/moe/fsdp_mixin#nemo_automodel-components-moe-fsdp_mixin-MoEFSDPSyncMixin)

Causal-LM head wrapping `BailingMoeV2Model`.

```python
nemo_automodel.components.models.ling_v2.model.BailingMoeV2ForCausalLM.forward(
    input_ids: torch.Tensor,
    position_ids: torch.Tensor | None = None,
    attention_mask: torch.Tensor | None = None,
    padding_mask: torch.Tensor | None = None,
    logits_to_keep: typing.Union[int, torch.Tensor] = 0,
    output_hidden_states: typing.Optional[bool] = None,
    attn_kwargs: typing.Any = {}
) -> transformers.modeling_outputs.CausalLMOutputWithPast
```

Forward pass returning `CausalLMOutputWithPast`.

Supports BSHD (`input_ids` shape `[B, S]`) and THD (squeezed to `[T]`
when `attn_kwargs["qkv_format"] == "thd"`) formats.

**Parameters:**

Input token IDs.

Optional position indices.

Optional 2D padding mask.

Optional padding mask used by the THD squeeze helper.

If > 0, only compute logits for the last
`logits_to_keep` positions (avoids materialising the full logit
matrix during generation / fused-CE training). `0` computes all
positions.

Whether to return the final hidden states (the
input to `lm_head`) on the output. Required by the fused
cross-entropy (cut-CE) training path.

Additional arguments forwarded to the base model.

**Returns:** `CausalLMOutputWithPast`

class:`~transformers.modeling_outputs.CausalLMOutputWithPast` with

```python
nemo_automodel.components.models.ling_v2.model.BailingMoeV2ForCausalLM.from_config(
    config: nemo_automodel.components.models.ling_v2.config.BailingMoeV2Config,
    moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    kwargs = {}
)
```

classmethod

```python
nemo_automodel.components.models.ling_v2.model.BailingMoeV2ForCausalLM.from_pretrained(
    pretrained_model_name_or_path: str,
    model_args = (),
    kwargs = {}
)
```

classmethod

```python
nemo_automodel.components.models.ling_v2.model.BailingMoeV2ForCausalLM.get_capabilities(
    config
) -> nemo_automodel._transformers.model_capabilities.ModelCapabilities
```

classmethod

Return parallelism capabilities for a specific Ling/Bailing-MoE config.

Three checkpoint variants share this class:

1. `inclusionAI/Ling-1T` -- 1T-param MoE, requires PP.
   Demonstrated by examples/llm\_finetune/ling/ling\_1t\_sft.yaml (pp=4)
   and ling\_1t\_lora\_pp.yaml (pp=8); both with ep\_size>=8.
2. `inclusionAI/Ling-flash-2.0` -- mid-size MoE, single-rank EP only.
   Demonstrated by ling\_flash\_2\_0\_sft.yaml / ling\_flash\_2\_0\_lora.yaml
   (pp=1, ep=8-32).
3. `inclusionAI/Ling-mini-2.0` -- small MoE, single-rank EP only.
   Demonstrated by ling\_mini\_2\_0\_\{hellaswag,sft,squad}.yaml
   (pp=1, ep=4-8).

Dispatch is on num\_hidden\_layers since Ling-1T (\~80 layers) is well
separated from Ling-flash-2.0 (\~32) and Ling-mini-2.0 (\~20).

```python
nemo_automodel.components.models.ling_v2.model.BailingMoeV2ForCausalLM.get_input_embeddings()
```

```python
nemo_automodel.components.models.ling_v2.model.BailingMoeV2ForCausalLM.get_output_embeddings()
```

```python
nemo_automodel.components.models.ling_v2.model.BailingMoeV2ForCausalLM.initialize_weights(
    buffer_device: torch.device | None = None,
    dtype: torch.dtype = torch.bfloat16
) -> None
```

```python
nemo_automodel.components.models.ling_v2.model.BailingMoeV2ForCausalLM.set_input_embeddings(
    value
)
```

```python
nemo_automodel.components.models.ling_v2.model.BailingMoeV2ForCausalLM.set_output_embeddings(
    new_embeddings
)
```

```python
nemo_automodel.components.models.ling_v2.model.BailingMoeV2ForCausalLM.update_moe_gate_bias() -> None
```

```python
class nemo_automodel.components.models.ling_v2.model.BailingMoeV2Model(
    config: nemo_automodel.components.models.ling_v2.config.BailingMoeV2Config,
    backend: nemo_automodel.components.models.common.BackendConfig,
    moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
    moe_overrides: dict | None = None
)
```

**Bases:** `Module`

Embedding + decoder stack + final norm.  No LM head.

```python
nemo_automodel.components.models.ling_v2.model.BailingMoeV2Model.forward(
    input_ids: torch.Tensor | None = None,
    inputs_embeds: torch.Tensor | None = None,
    position_ids: torch.Tensor | None = None,
    attention_mask: torch.Tensor | None = None,
    padding_mask: torch.Tensor | None = None,
    attn_kwargs: typing.Any = {}
) -> torch.Tensor
```

```python
nemo_automodel.components.models.ling_v2.model.BailingMoeV2Model.init_weights(
    buffer_device: torch.device | None = None
) -> None
```

```python
nemo_automodel.components.models.ling_v2.model.BailingMoeV2Model.update_moe_gate_bias() -> None
```

No-op for SFT; published Ling checkpoints freeze the expert\_bias buffer.

```python
class nemo_automodel.components.models.ling_v2.model.Block(
    layer_idx: int,
    config: nemo_automodel.components.models.ling_v2.config.BailingMoeV2Config,
    moe_config: nemo_automodel.components.moe.config.MoEConfig,
    backend: nemo_automodel.components.models.common.BackendConfig
)
```

**Bases:** `Module`

Single transformer block: attention + (dense MLP or MoE) + residuals.

```python
nemo_automodel.components.models.ling_v2.model.Block._mlp(
    x: torch.Tensor,
    padding_mask: torch.Tensor | None
) -> torch.Tensor
```

```python
nemo_automodel.components.models.ling_v2.model.Block.forward(
    x: torch.Tensor,
    freqs_cis: torch.Tensor,
    attention_mask: torch.Tensor | None = None,
    padding_mask: torch.Tensor | None = None,
    attn_kwargs: typing.Any = {}
) -> torch.Tensor
```

```python
nemo_automodel.components.models.ling_v2.model.Block.init_weights(
    buffer_device: torch.device
) -> None
```

```python
nemo_automodel.components.models.ling_v2.model.ModelClass = BailingMoeV2ForCausalLM
```