nemo_automodel.components.models.nemotron_v3.layers#
Module Contents#
Classes#
Multi-headed attention for NemotronV3 (Nano-v3). |
|
Gated RMSNorm for Mamba layers. |
|
Mamba2 mixer for NemotronV3 (training-only, uses CUDA kernels). |
|
NemotronV3 decoder block (training-only, simplified). |
API#
- class nemo_automodel.components.models.nemotron_v3.layers.NemotronV3Attention(config)#
Bases:
torch.nn.ModuleMulti-headed attention for NemotronV3 (Nano-v3).
This is a standard GQA attention module following the NemotronH architecture. Uses PyTorchβs scaled_dot_product_attention (SDPA) for the attention computation. Note: RoPE is not applied in this module, matching the HF NemotronHAttention implementation.
Initialization
- forward(
- hidden_states: torch.Tensor,
- attention_mask: torch.Tensor | None = None,
- init_weights(
- num_hidden_layers: int,
- rescale_prenorm_residual: bool = True,
- buffer_device: torch.device | None = None,
Initialize attention weights following NemotronV3 spec.
- class nemo_automodel.components.models.nemotron_v3.layers.NemotronV3MambaRMSNormGated(
- hidden_size: int,
- group_size: int,
- eps: float = 1e-05,
Bases:
torch.nn.ModuleGated RMSNorm for Mamba layers.
Uses the fused triton kernel from mamba_ssm for efficiency.
Initialization
- forward(
- hidden_states: torch.Tensor,
- gate: torch.Tensor | None = None,
- class nemo_automodel.components.models.nemotron_v3.layers.NemotronV3Mamba2Mixer(config, layer_idx: int)#
Bases:
torch.nn.ModuleMamba2 mixer for NemotronV3 (training-only, uses CUDA kernels).
This implementation uses the fused mamba_split_conv1d_scan_combined kernel for maximum training efficiency. Does not support inference caching.
Requires mamba_ssm and causal_conv1d packages.
Initialization
- forward(
- hidden_states: torch.Tensor,
- attention_mask: torch.Tensor | None = None,
Forward pass using fused CUDA kernels (training only).
- Parameters:
hidden_states β Input tensor of shape (batch, seq_len, hidden_size)
attention_mask β Optional attention mask (applied to padding)
- Returns:
Output tensor of shape (batch, seq_len, hidden_size)
- init_weights(
- num_hidden_layers: int,
- rescale_prenorm_residual: bool = True,
- buffer_device: torch.device | None = None,
Initialize Mamba2Mixer weights following NemotronV3 spec.
- class nemo_automodel.components.models.nemotron_v3.layers.NemotronV3Block(config, layer_idx: int, moe_config=None, backend=None)#
Bases:
torch.nn.ModuleNemotronV3 decoder block (training-only, simplified).
Pre-norm architecture: norm β mixer β residual add Supports hybrid layer types: Mamba, Attention, MLP, MoE
Initialization
Initialize NemotronV3Block.
- Parameters:
config β Model configuration with layers_block_type attribute
layer_idx β Index of this layer in the model
moe_config β MoE configuration (required for MoE layers)
backend β Backend configuration (optional)
- property mlp#
Return mixer for MoE blocks for compatibility with parallelizer.
- forward(
- hidden_states: torch.Tensor,
- attention_mask: torch.Tensor | None = None,
Forward pass through the block.
- Parameters:
hidden_states β Input tensor of shape (batch, seq_len, hidden_size)
attention_mask β
Mask tensor - type depends on layer:
For attention: 4D causal mask [batch, 1, seq_len, seq_len]
For mamba: 2D padding mask [batch, seq_len]
For mlp/moe: None
- Returns:
Output tensor of shape (batch, seq_len, hidden_size)
- init_weights(buffer_device: torch.device | None = None) None#
Initialize block weights following NemotronV3 spec.
- Parameters:
buffer_device β Device for buffer initialization (used by MLP/MoE)