nemo_automodel.components.models.step3p5.layers
nemo_automodel.components.models.step3p5.layers
Module Contents
Classes
API
Bases: Module
Step3p5 attention with Q/K per-head RMSNorm, optional head-wise gate, and alternating attention patterns.
Key features:
- Q/K per-head normalization using Step3p5RMSNorm
- Optional head-wise attention gate (g_proj + sigmoid)
- Per-layer RoPE theta and partial_rotary_factors
- Sliding window based on layer_types config
Bases: Module
Step3p5 MLP with SwiGLU activation and optional clamping.
Bases: Module
RMSNorm with (weight + 1) scaling used by Step3p5.
Unlike standard RMSNorm which uses x_normed * weight, Step3p5 uses
x_normed * (weight + 1). The weight is initialized to zeros,
so initially the scaling factor is 1.
Note: Cannot use TE’s fused RMSNorm because the (weight + 1) adjustment cannot be intercepted.
Reset parameters to initial state (zeros).
Bases: Module
Rotary embedding for Step3p5 with per-layer theta and partial rotary factor support.
Compute inverse frequencies for rotary embeddings.
Compute cos and sin for rotary embeddings.
Parameters:
Input tensor (used for dtype and device).
Position indices [batch_size, seq_len].
Returns: tuple[torch.Tensor, torch.Tensor]
Tuple of (cos, sin) tensors.