`nemo_automodel.components.models.nemotron_v3.layers`#

Module Contents#

Classes#

`NemotronV3Attention`	GQA attention for NemotronV3 (no RoPE), compatible with TE/SDPA backends.
`NemotronV3MambaRMSNormGated`	Gated RMSNorm for Mamba layers.
`NemotronV3Mamba2Mixer`	Mamba2 mixer for NemotronV3 (training-only, uses CUDA kernels).
`NemotronV3Block`	NemotronV3 decoder block (training-only, simplified).

API#

class nemo_automodel.components.models.nemotron_v3.layers.NemotronV3Attention( config, backend: nemo_automodel.components.models.common.BackendConfig | None = None, )#

Bases: torch.nn.Module

GQA attention for NemotronV3 (no RoPE), compatible with TE/SDPA backends.

Initialization

forward(

hidden_states: torch.Tensor,

attention_mask: torch.Tensor | None = None,

past_key_values=None,

layer_idx: int | None = None,

**attn_kwargs,

) → torch.Tensor#

init_weights( num_hidden_layers: int, rescale_prenorm_residual: bool = True, buffer_device: torch.device | None = None, ) → None#

class nemo_automodel.components.models.nemotron_v3.layers.NemotronV3MambaRMSNormGated( hidden_size: int, group_size: int, eps: float = 1e-05, )#

Bases: torch.nn.Module

Gated RMSNorm for Mamba layers.

Uses the fused triton kernel from mamba_ssm for efficiency.

Initialization

forward( hidden_states: torch.Tensor, gate: torch.Tensor | None = None, ) → torch.Tensor#

class nemo_automodel.components.models.nemotron_v3.layers.NemotronV3Mamba2Mixer(config, layer_idx: int)#

Bases: torch.nn.Module

Mamba2 mixer for NemotronV3 (training-only, uses CUDA kernels).

This implementation uses the fused mamba_split_conv1d_scan_combined kernel for maximum training efficiency. Does not support inference caching.

Requires mamba_ssm and causal_conv1d packages.

Initialization

forward(

hidden_states: torch.Tensor,

attention_mask: torch.Tensor | None = None,

past_key_values=None,

cache_position: torch.LongTensor | None = None,

**kwargs,

) → torch.Tensor#

Forward pass with three code paths.

Path A (training): past_key_values is None → fully-fused kernel. Path B (prefill): past_key_values present, seq_len > 1 → unfused scan + cache init. Path C (decode): past_key_values present, seq_len == 1, has_previous_state → single-step update.

Parameters:

hidden_states – Input tensor of shape (batch, seq_len, hidden_size)
attention_mask – Optional attention mask (applied to padding)
past_key_values – Optional NemotronHybridCache instance.
cache_position – Token positions for cache updates.

Returns:

Output tensor of shape (batch, seq_len, hidden_size)

init_weights( num_hidden_layers: int, rescale_prenorm_residual: bool = True, buffer_device: torch.device | None = None, ) → None#: Initialize Mamba2Mixer weights following NemotronV3 spec.

class nemo_automodel.components.models.nemotron_v3.layers.NemotronV3Block( config, layer_idx: int, moe_config=None, backend=None, block_type: str | None = None, )#

Bases: torch.nn.Module

NemotronV3 decoder block (training-only, simplified).

Pre-norm architecture: norm → mixer → residual add Supports hybrid layer types: Mamba, Attention, MLP, MoE

Initialization

Initialize NemotronV3Block.

Parameters:

config – Model configuration with layers_block_type attribute
layer_idx – Index of this layer in the model
moe_config – MoE configuration (required for MoE layers)
backend – Backend configuration (optional)
block_type – Optional override for the block type. When None (default) the type is read from config.layers_block_type[layer_idx]. Used by callers that build extra blocks outside the main backbone’s per-layer pattern (e.g. MTP sublayers at indices past num_hidden_layers).

property layer_type#: Map block_type to MoE parallelizer’s layer_type convention.

property self_attn#: Alias for mixer, for compatibility with MoE parallelizer.

property mlp#: Return mixer for MoE blocks for compatibility with parallelizer.

forward(

hidden_states: torch.Tensor,

attention_mask: torch.Tensor | None = None,

past_key_values=None,

cache_position: torch.LongTensor | None = None,

**attn_kwargs,

) → torch.Tensor#

Forward pass through the block.

Parameters:

hidden_states – Input tensor of shape (batch, seq_len, hidden_size)
attention_mask –
Mask tensor - type depends on layer:
- For attention: 4D causal mask [batch, 1, seq_len, seq_len]
- For mamba: 2D padding mask [batch, seq_len]
- For mlp/moe: None
past_key_values – Optional NemotronHybridCache for KV/SSM caching.
cache_position – Token position indices for cache updates.
**attn_kwargs – Additional keyword arguments forwarded to attention layers only (e.g. cu_seqlens, cp_size, cp_rank for Context Parallelism).

Returns:

Output tensor of shape (batch, seq_len, hidden_size)

init_weights(buffer_device: torch.device | None = None) → None#

Initialize block weights following NemotronV3 spec.

Parameters:: buffer_device – Device for buffer initialization (used by MLP/MoE)

nemo_automodel.components.models.nemotron_v3.layers#

Module Contents#

Classes#

API#

`nemo_automodel.components.models.nemotron_v3.layers`#