> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.models.glm_moe_dsa.model

## Module Contents

### Classes

| Name                                                                                               | Description |
| -------------------------------------------------------------------------------------------------- | ----------- |
| [`Block`](#nemo_automodel-components-models-glm_moe_dsa-model-Block)                               | -           |
| [`GlmMoeDsaForCausalLM`](#nemo_automodel-components-models-glm_moe_dsa-model-GlmMoeDsaForCausalLM) | -           |
| [`GlmMoeDsaModel`](#nemo_automodel-components-models-glm_moe_dsa-model-GlmMoeDsaModel)             | -           |

### Data

[`ModelClass`](#nemo_automodel-components-models-glm_moe_dsa-model-ModelClass)

### API

```python
class nemo_automodel.components.models.glm_moe_dsa.model.Block(
    layer_idx: int,
    config: transformers.models.glm_moe_dsa.configuration_glm_moe_dsa.GlmMoeDsaConfig,
    moe_config: nemo_automodel.components.moe.layers.MoEConfig,
    backend: nemo_automodel.components.models.common.BackendConfig
)
```

**Bases:** `Module`

```python
nemo_automodel.components.models.glm_moe_dsa.model.Block._mlp(
    x: torch.Tensor,
    padding_mask: torch.Tensor | None
) -> torch.Tensor
```

```python
nemo_automodel.components.models.glm_moe_dsa.model.Block.forward(
    x: torch.Tensor,
    freqs_cis: torch.Tensor,
    attention_mask: torch.Tensor | None = None,
    padding_mask: torch.Tensor | None = None,
    prev_topk_indices: torch.Tensor | None = None,
    attn_kwargs: typing.Any = {}
) -> tuple[torch.Tensor, torch.Tensor]
```

Run the block and return `(hidden_states, topk_indices)`.

`topk_indices` is this layer's DSA selection — freshly computed on "full" layers,
or `prev_topk_indices` passed through on "shared" layers — so the caller can thread
it to subsequent shared layers (GLM IndexShare).

```python
nemo_automodel.components.models.glm_moe_dsa.model.Block.init_weights(
    buffer_device: torch.device
)
```

```python
class nemo_automodel.components.models.glm_moe_dsa.model.GlmMoeDsaForCausalLM(
    config: transformers.models.glm_moe_dsa.configuration_glm_moe_dsa.GlmMoeDsaConfig,
    moe_config: nemo_automodel.components.moe.layers.MoEConfig | None = None,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    kwargs = {}
)
```

**Bases:** [HFCheckpointingMixin](/nemo-automodel/nemo_automodel/components/models/common/hf_checkpointing_mixin#nemo_automodel-components-models-common-hf_checkpointing_mixin-HFCheckpointingMixin), `Module`, [MoEFSDPSyncMixin](/nemo-automodel/nemo_automodel/components/moe/fsdp_mixin#nemo_automodel-components-moe-fsdp_mixin-MoEFSDPSyncMixin)

```python
nemo_automodel.components.models.glm_moe_dsa.model.GlmMoeDsaForCausalLM._is_pipeline_parallel_stage() -> bool
```

True when this module is a trimmed pipeline-parallel stage (not the whole model).

```python
nemo_automodel.components.models.glm_moe_dsa.model.GlmMoeDsaForCausalLM.forward(
    input_ids: torch.Tensor,
    carry: torch.Tensor = (),
    position_ids: torch.Tensor | None = None,
    attention_mask: torch.Tensor | None = None,
    padding_mask: torch.Tensor | None = None,
    logits_to_keep: typing.Union[int, torch.Tensor] = 0,
    output_hidden_states: typing.Optional[bool] = None,
    attn_kwargs: typing.Any = {}
) -> transformers.modeling_outputs.CausalLMOutputWithPast | tuple[torch.Tensor, ...] | torch.Tensor
```

Forward pass.

Single process (no pipeline parallelism): returns
:class:`~transformers.modeling_outputs.CausalLMOutputWithPast`, threading the IndexShare
top-k internally (seeded `None`).

Pipeline parallelism: `input_ids` is the upstream hidden state on non-first stages and
`*carry` holds the previous stage's running top-k selection. Non-last stages return
`(hidden_states, topk_indices)` and the last stage returns the `logits` tensor.

**Parameters:**

Token IDs (BSHD `[B, S]` / THD `[1, T]`) on the first stage, or the
upstream hidden state on later pipeline stages.

Optional `(topk_indices,)` carried from the previous pipeline stage.

Optional masks / positions.

If `0`, project all positions; else only the last `logits_to_keep`.

When set (single-process), carry final hidden states on the output.

Additional arguments forwarded to the base model.

```python
nemo_automodel.components.models.glm_moe_dsa.model.GlmMoeDsaForCausalLM.from_config(
    config: transformers.models.glm_moe_dsa.configuration_glm_moe_dsa.GlmMoeDsaConfig,
    moe_config: nemo_automodel.components.moe.layers.MoEConfig | None = None,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    kwargs = {}
)
```

classmethod

```python
nemo_automodel.components.models.glm_moe_dsa.model.GlmMoeDsaForCausalLM.from_pretrained(
    pretrained_model_name_or_path: str,
    model_args = (),
    kwargs = {}
)
```

classmethod

```python
nemo_automodel.components.models.glm_moe_dsa.model.GlmMoeDsaForCausalLM.get_input_embeddings()
```

```python
nemo_automodel.components.models.glm_moe_dsa.model.GlmMoeDsaForCausalLM.get_output_embeddings()
```

```python
nemo_automodel.components.models.glm_moe_dsa.model.GlmMoeDsaForCausalLM.get_pipeline_stage_metas(
    is_first: bool,
    microbatch_size: int,
    seq_len: int,
    dtype: torch.dtype
) -> tuple[tuple[torch.Tensor, ...], tuple[torch.Tensor, ...]]
```

Declare PP inter-stage I/O metas, threading the IndexShare top-k as a carry tensor.

Non-first stages additionally receive the previous "full" layer's top-k selection, and
non-last stages emit the running selection, so a stage that begins with a "shared" layer
has the top-k it needs (correct at any sequence length).

```python
nemo_automodel.components.models.glm_moe_dsa.model.GlmMoeDsaForCausalLM.initialize_weights(
    buffer_device: torch.device | None = None,
    dtype: torch.dtype = torch.bfloat16
) -> None
```

```python
nemo_automodel.components.models.glm_moe_dsa.model.GlmMoeDsaForCausalLM.set_input_embeddings(
    value
)
```

```python
nemo_automodel.components.models.glm_moe_dsa.model.GlmMoeDsaForCausalLM.set_output_embeddings(
    new_embeddings
)
```

```python
class nemo_automodel.components.models.glm_moe_dsa.model.GlmMoeDsaModel(
    config: transformers.models.glm_moe_dsa.configuration_glm_moe_dsa.GlmMoeDsaConfig,
    backend: nemo_automodel.components.models.common.BackendConfig,
    moe_config: nemo_automodel.components.moe.layers.MoEConfig | None = None,
    moe_overrides: dict | None = None
)
```

**Bases:** `Module`

```python
nemo_automodel.components.models.glm_moe_dsa.model.GlmMoeDsaModel.forward(
    input_ids: torch.Tensor,
    position_ids: torch.Tensor | None = None,
    attention_mask: torch.Tensor | None = None,
    padding_mask: torch.Tensor | None = None,
    prev_topk_indices: torch.Tensor | None = None,
    attn_kwargs: typing.Any = {}
) -> tuple[torch.Tensor, torch.Tensor | None]
```

Run the decoder stack, returning `(hidden_states, topk_indices)`.

`prev_topk_indices` seeds the IndexShare running selection (used under pipeline
parallelism, where an earlier "full" layer lives on the previous stage); it is `None`
in the single-process path. The returned `topk_indices` is the running selection at the
end of this stage's layers, so it can be carried to the next pipeline stage.

```python
nemo_automodel.components.models.glm_moe_dsa.model.GlmMoeDsaModel.init_weights(
    buffer_device: torch.device | None = None
) -> None
```

```python
nemo_automodel.components.models.glm_moe_dsa.model.ModelClass = GlmMoeDsaForCausalLM
```