> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.models.step3p5.layers

## Module Contents

### Classes

| Name                                                                                                | Description                                                                                               |
| --------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------- |
| [`Step3p5Attention`](#nemo_automodel-components-models-step3p5-layers-Step3p5Attention)             | Step3p5 attention with Q/K per-head RMSNorm, optional head-wise gate, and alternating attention patterns. |
| [`Step3p5MLP`](#nemo_automodel-components-models-step3p5-layers-Step3p5MLP)                         | Step3p5 MLP with SwiGLU activation and optional clamping.                                                 |
| [`Step3p5RMSNorm`](#nemo_automodel-components-models-step3p5-layers-Step3p5RMSNorm)                 | RMSNorm with (weight + 1) scaling used by Step3p5.                                                        |
| [`Step3p5RotaryEmbedding`](#nemo_automodel-components-models-step3p5-layers-Step3p5RotaryEmbedding) | Rotary embedding for Step3p5 with per-layer theta and partial rotary factor support.                      |

### API

```python
class nemo_automodel.components.models.step3p5.layers.Step3p5Attention(
    config: typing.Any,
    layer_idx: int,
    backend: nemo_automodel.components.models.common.BackendConfig
)
```

**Bases:** `Module`

Step3p5 attention with Q/K per-head RMSNorm, optional head-wise gate, and alternating attention patterns.

Key features:

* Q/K per-head normalization using Step3p5RMSNorm
* Optional head-wise attention gate (g\_proj + sigmoid)
* Per-layer RoPE theta and partial\_rotary\_factors
* Sliding window based on layer\_types config

```python
nemo_automodel.components.models.step3p5.layers.Step3p5Attention.forward(
    x: torch.Tensor,
    freqs_cis: torch.Tensor | None = None,
    attention_mask: torch.Tensor | None = None,
    position_ids: torch.Tensor | None = None,
    attn_kwargs: typing.Any = {}
) -> torch.Tensor
```

```python
nemo_automodel.components.models.step3p5.layers.Step3p5Attention.init_weights(
    buffer_device: torch.device,
    init_std: float = 0.02
) -> None
```

```python
class nemo_automodel.components.models.step3p5.layers.Step3p5MLP(
    config: typing.Any,
    backend: nemo_automodel.components.models.common.BackendConfig,
    intermediate_size: int | None = None,
    swiglu_limit: float | None = None
)
```

**Bases:** `Module`

Step3p5 MLP with SwiGLU activation and optional clamping.

```python
nemo_automodel.components.models.step3p5.layers.Step3p5MLP.forward(
    x: torch.Tensor
) -> torch.Tensor
```

```python
nemo_automodel.components.models.step3p5.layers.Step3p5MLP.init_weights(
    buffer_device: torch.device,
    init_std: float = 0.02
) -> None
```

```python
class nemo_automodel.components.models.step3p5.layers.Step3p5RMSNorm(
    hidden_size: int,
    eps: float = 1e-05
)
```

**Bases:** `Module`

RMSNorm with (weight + 1) scaling used by Step3p5.

Unlike standard RMSNorm which uses `x_normed * weight`, Step3p5 uses
`x_normed * (weight + 1)`. The weight is initialized to zeros,
so initially the scaling factor is 1.

Note: Cannot use TE's fused RMSNorm because the (weight + 1) adjustment
cannot be intercepted.

```python
nemo_automodel.components.models.step3p5.layers.Step3p5RMSNorm.forward(
    x: torch.Tensor
) -> torch.Tensor
```

```python
nemo_automodel.components.models.step3p5.layers.Step3p5RMSNorm.reset_parameters() -> None
```

Reset parameters to initial state (zeros).

```python
class nemo_automodel.components.models.step3p5.layers.Step3p5RotaryEmbedding(
    config: typing.Any,
    layer_idx: int
)
```

**Bases:** `Module`

Rotary embedding for Step3p5 with per-layer theta and partial rotary factor support.

```python
nemo_automodel.components.models.step3p5.layers.Step3p5RotaryEmbedding._apply(
    fn
)
```

```python
nemo_automodel.components.models.step3p5.layers.Step3p5RotaryEmbedding._compute_inv_freq(
    device: torch.device | None = None
) -> torch.Tensor
```

Compute inverse frequencies for rotary embeddings.

```python
nemo_automodel.components.models.step3p5.layers.Step3p5RotaryEmbedding.forward(
    x: torch.Tensor,
    position_ids: torch.Tensor
) -> tuple[torch.Tensor, torch.Tensor]
```

Compute cos and sin for rotary embeddings.

**Parameters:**

Input tensor (used for dtype and device).

Position indices \[batch\_size, seq\_len].

**Returns:** `tuple[torch.Tensor, torch.Tensor]`

Tuple of (cos, sin) tensors.