nemo_automodel.components.models.nemotron_v3.layers

View as Markdown

Module Contents

Classes

NameDescription
NemotronV3AttentionGQA attention for NemotronV3 (no RoPE), compatible with TE/SDPA backends.
NemotronV3BlockNemotronV3 decoder block (training-only, simplified).
NemotronV3Mamba2MixerMamba2 mixer for NemotronV3 (training-only, uses CUDA kernels).
NemotronV3MambaRMSNormGatedGated RMSNorm for Mamba layers.

API

class nemo_automodel.components.models.nemotron_v3.layers.NemotronV3Attention(
config,
backend: nemo_automodel.components.models.common.BackendConfig | None = None
)

Bases: Module

GQA attention for NemotronV3 (no RoPE), compatible with TE/SDPA backends.

attention_bias
= getattr(config, 'attention_bias', False)
attention_dropout
= getattr(config, 'attention_dropout', 0.0)
backend
= backend or BackendConfig()
head_dim
= config.head_dim
hidden_size
= config.hidden_size
k_proj
num_attention_heads
= config.num_attention_heads
num_hidden_layers
= int(getattr(config, 'num_hidden_layers', 0))
num_key_value_heads
= config.num_key_value_heads
o_proj
q_proj
v_proj
nemo_automodel.components.models.nemotron_v3.layers.NemotronV3Attention.forward(
hidden_states: torch.Tensor,
attention_mask: torch.Tensor | None = None,
past_key_values = None,
layer_idx: int | None = None,
attn_kwargs = {}
) -> torch.Tensor
nemo_automodel.components.models.nemotron_v3.layers.NemotronV3Attention.init_weights(
num_hidden_layers: int,
rescale_prenorm_residual: bool = True,
buffer_device: torch.device | None = None
) -> None
class nemo_automodel.components.models.nemotron_v3.layers.NemotronV3Block(
config,
layer_idx: int,
moe_config = None,
backend = None,
block_type: str | None = None
)

Bases: Module

NemotronV3 decoder block (training-only, simplified).

Pre-norm architecture: norm → mixer → residual add Supports hybrid layer types: Mamba, Attention, MLP, MoE

block_type
layer_type

Map block_type to MoE parallelizer’s layer_type convention.

mixer
= NemotronV3Mamba2Mixer(config, layer_idx=layer_idx)
mlp

Return mixer for MoE blocks for compatibility with parallelizer.

norm
residual_in_fp32
= getattr(config, 'residual_in_fp32', False)
self_attn

Alias for mixer, for compatibility with MoE parallelizer.

nemo_automodel.components.models.nemotron_v3.layers.NemotronV3Block.forward(
hidden_states: torch.Tensor,
attention_mask: torch.Tensor | None = None,
past_key_values = None,
cache_position: torch.LongTensor | None = None,
attn_kwargs = {}
) -> torch.Tensor

Forward pass through the block.

Parameters:

hidden_states
torch.Tensor

Input tensor of shape (batch, seq_len, hidden_size)

attention_mask
torch.Tensor | NoneDefaults to None

Mask tensor - type depends on layer:

  • For attention: 4D causal mask [batch, 1, seq_len, seq_len]
  • For mamba: 2D padding mask [batch, seq_len]
  • For mlp/moe: None
past_key_values
Defaults to None

Optional NemotronHybridCache for KV/SSM caching.

cache_position
torch.LongTensor | NoneDefaults to None

Token position indices for cache updates.

**attn_kwargs
Defaults to {}

Additional keyword arguments forwarded to attention layers only (e.g. cu_seqlens, cp_size, cp_rank for Context Parallelism).

Returns: torch.Tensor

Output tensor of shape (batch, seq_len, hidden_size)

nemo_automodel.components.models.nemotron_v3.layers.NemotronV3Block.init_weights(
buffer_device: torch.device | None = None
) -> None

Initialize block weights following NemotronV3 spec.

Parameters:

buffer_device
torch.device | NoneDefaults to None

Device for buffer initialization (used by MLP/MoE)

class nemo_automodel.components.models.nemotron_v3.layers.NemotronV3Mamba2Mixer(
config,
layer_idx: int
)

Bases: Module

Mamba2 mixer for NemotronV3 (training-only, uses CUDA kernels).

This implementation uses the fused mamba_split_conv1d_scan_combined kernel for maximum training efficiency. Does not support inference caching.

Requires mamba_ssm and causal_conv1d packages.

A_log
= nn.Parameter(torch.log(A))
D
= nn.Parameter(torch.ones(self.num_heads))
activation
= config.mamba_hidden_act
chunk_size
= config.chunk_size
conv1d
conv_dim
conv_kernel_size
= config.conv_kernel
dt_bias
= nn.Parameter(torch.ones(self.num_heads))
head_dim
= config.mamba_head_dim
hidden_size
= config.hidden_size
in_proj
intermediate_size
= self.num_heads * self.head_dim
n_groups
= config.n_groups
norm
num_heads
= config.mamba_num_heads
out_proj
ssm_state_size
= config.ssm_state_size
time_step_floor
= config.time_step_floor
time_step_limit
= config.time_step_limit
time_step_max
= config.time_step_max
time_step_min
= config.time_step_min
use_conv_bias
= config.use_conv_bias
nemo_automodel.components.models.nemotron_v3.layers.NemotronV3Mamba2Mixer.forward(
hidden_states: torch.Tensor,
attention_mask: torch.Tensor | None = None,
past_key_values = None,
cache_position: torch.LongTensor | None = None,
kwargs = {}
) -> torch.Tensor

Forward pass with three code paths.

Path A (training): past_key_values is None → fully-fused kernel. Path B (prefill): past_key_values present, seq_len > 1 → unfused scan + cache init. Path C (decode): past_key_values present, seq_len == 1, has_previous_state → single-step update.

Parameters:

hidden_states
torch.Tensor

Input tensor of shape (batch, seq_len, hidden_size)

attention_mask
torch.Tensor | NoneDefaults to None

Optional attention mask (applied to padding)

past_key_values
Defaults to None

Optional NemotronHybridCache instance.

cache_position
torch.LongTensor | NoneDefaults to None

Token positions for cache updates.

Returns: torch.Tensor

Output tensor of shape (batch, seq_len, hidden_size)

nemo_automodel.components.models.nemotron_v3.layers.NemotronV3Mamba2Mixer.init_weights(
num_hidden_layers: int,
rescale_prenorm_residual: bool = True,
buffer_device: torch.device | None = None
) -> None

Initialize Mamba2Mixer weights following NemotronV3 spec.

class nemo_automodel.components.models.nemotron_v3.layers.NemotronV3MambaRMSNormGated(
hidden_size: int,
group_size: int,
eps: float = 1e-05
)

Bases: Module

Gated RMSNorm for Mamba layers.

Uses the fused triton kernel from mamba_ssm for efficiency.

weight
= nn.Parameter(torch.ones(hidden_size))
nemo_automodel.components.models.nemotron_v3.layers.NemotronV3MambaRMSNormGated.forward(
hidden_states: torch.Tensor,
gate: torch.Tensor | None = None
) -> torch.Tensor