nemo_automodel.components.models.nemotron_v3.layers

Module Contents

Classes

Name	Description
`NemotronV3Attention`	GQA attention for NemotronV3 (no RoPE), compatible with TE/SDPA backends.
`NemotronV3Block`	NemotronV3 decoder block (training-only, simplified).
`NemotronV3Mamba2Mixer`	Mamba2 mixer for NemotronV3 (training-only, uses CUDA kernels).
`NemotronV3MambaRMSNormGated`	Gated RMSNorm for Mamba layers.

API

class nemo_automodel.components.models.nemotron_v3.layers.NemotronV3Attention(
    config,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None
)

Bases: Module

GQA attention for NemotronV3 (no RoPE), compatible with TE/SDPA backends.

attention_bias

= getattr(config, 'attention_bias', False)

attention_dropout

= getattr(config, 'attention_dropout', 0.0)

backend

= backend or BackendConfig()

head_dim

= config.head_dim

hidden_size

= config.hidden_size

k_proj

num_attention_heads

= config.num_attention_heads

num_hidden_layers

= int(getattr(config, 'num_hidden_layers', 0))

num_key_value_heads

= config.num_key_value_heads

o_proj

q_proj

v_proj

nemo_automodel.components.models.nemotron_v3.layers.NemotronV3Attention.forward(
    hidden_states: torch.Tensor,
    attention_mask: torch.Tensor | None = None,
    past_key_values = None,
    layer_idx: int | None = None,
    attn_kwargs = {}
) -> torch.Tensor

nemo_automodel.components.models.nemotron_v3.layers.NemotronV3Attention.init_weights(
    num_hidden_layers: int,
    rescale_prenorm_residual: bool = True,
    buffer_device: torch.device | None = None
) -> None

class nemo_automodel.components.models.nemotron_v3.layers.NemotronV3Block(
    config,
    layer_idx: int,
    moe_config = None,
    backend = None,
    block_type: str | None = None
)

Bases: Module

NemotronV3 decoder block (training-only, simplified).

Pre-norm architecture: norm → mixer → residual add Supports hybrid layer types: Mamba, Attention, MLP, MoE

block_type

layer_type

Map block_type to MoE parallelizer’s layer_type convention.

mixer

= NemotronV3Mamba2Mixer(config, layer_idx=layer_idx)

mlp

Return mixer for MoE blocks for compatibility with parallelizer.

norm

residual_in_fp32

= getattr(config, 'residual_in_fp32', False)

self_attn

Alias for mixer, for compatibility with MoE parallelizer.

nemo_automodel.components.models.nemotron_v3.layers.NemotronV3Block.forward(
    hidden_states: torch.Tensor,
    attention_mask: torch.Tensor | None = None,
    past_key_values = None,
    cache_position: torch.LongTensor | None = None,
    attn_kwargs = {}
) -> torch.Tensor

Forward pass through the block.

Parameters:

hidden_states

torch.Tensor

Input tensor of shape (batch, seq_len, hidden_size)

attention_mask

torch.Tensor | NoneDefaults to None

Mask tensor - type depends on layer:

For attention: 4D causal mask [batch, 1, seq_len, seq_len]
For mamba: 2D padding mask [batch, seq_len]
For mlp/moe: None

past_key_values

Defaults to None

Optional NemotronHybridCache for KV/SSM caching.

cache_position

torch.LongTensor | NoneDefaults to None

Token position indices for cache updates.

**attn_kwargs

Defaults to {}

Additional keyword arguments forwarded to attention layers only (e.g. cu_seqlens, cp_size, cp_rank for Context Parallelism).

Returns: torch.Tensor

Output tensor of shape (batch, seq_len, hidden_size)

nemo_automodel.components.models.nemotron_v3.layers.NemotronV3Block.init_weights(
    buffer_device: torch.device | None = None
) -> None

Initialize block weights following NemotronV3 spec.

Parameters:

buffer_device

torch.device | NoneDefaults to None

Device for buffer initialization (used by MLP/MoE)

class nemo_automodel.components.models.nemotron_v3.layers.NemotronV3Mamba2Mixer(
    config,
    layer_idx: int
)

Bases: Module

Mamba2 mixer for NemotronV3 (training-only, uses CUDA kernels).

This implementation uses the fused mamba_split_conv1d_scan_combined kernel for maximum training efficiency. Does not support inference caching.

Requires mamba_ssm and causal_conv1d packages.

A_log

= nn.Parameter(torch.log(A))

= nn.Parameter(torch.ones(self.num_heads))

activation

= config.mamba_hidden_act

chunk_size

= config.chunk_size

conv1d

conv_dim

conv_kernel_size

= config.conv_kernel

dt_bias

= nn.Parameter(torch.ones(self.num_heads))

head_dim

= config.mamba_head_dim

hidden_size

= config.hidden_size

in_proj

intermediate_size

= self.num_heads * self.head_dim

n_groups

= config.n_groups

norm

num_heads

= config.mamba_num_heads

out_proj

ssm_state_size

= config.ssm_state_size

time_step_floor

= config.time_step_floor

time_step_limit

= config.time_step_limit

time_step_max

= config.time_step_max

time_step_min

= config.time_step_min

use_conv_bias

= config.use_conv_bias

nemo_automodel.components.models.nemotron_v3.layers.NemotronV3Mamba2Mixer.forward(
    hidden_states: torch.Tensor,
    attention_mask: torch.Tensor | None = None,
    past_key_values = None,
    cache_position: torch.LongTensor | None = None,
    kwargs = {}
) -> torch.Tensor

Forward pass with three code paths.

Path A (training): past_key_values is None → fully-fused kernel. Path B (prefill): past_key_values present, seq_len > 1 → unfused scan + cache init. Path C (decode): past_key_values present, seq_len == 1, has_previous_state → single-step update.

Parameters:

hidden_states

torch.Tensor

Input tensor of shape (batch, seq_len, hidden_size)

attention_mask

torch.Tensor | NoneDefaults to None

Optional attention mask (applied to padding)

past_key_values

Defaults to None

Optional NemotronHybridCache instance.

cache_position

torch.LongTensor | NoneDefaults to None

Token positions for cache updates.

Returns: torch.Tensor

Output tensor of shape (batch, seq_len, hidden_size)

nemo_automodel.components.models.nemotron_v3.layers.NemotronV3Mamba2Mixer.init_weights(
    num_hidden_layers: int,
    rescale_prenorm_residual: bool = True,
    buffer_device: torch.device | None = None
) -> None

Initialize Mamba2Mixer weights following NemotronV3 spec.

class nemo_automodel.components.models.nemotron_v3.layers.NemotronV3MambaRMSNormGated(
    hidden_size: int,
    group_size: int,
    eps: float = 1e-05
)

Bases: Module

Gated RMSNorm for Mamba layers.

Uses the fused triton kernel from mamba_ssm for efficiency.

weight

= nn.Parameter(torch.ones(hidden_size))

nemo_automodel.components.models.nemotron_v3.layers.NemotronV3MambaRMSNormGated.forward(
    hidden_states: torch.Tensor,
    gate: torch.Tensor | None = None
) -> torch.Tensor