`nemo_automodel.components.models.nemotron_v3.mtp`#

NemotronV3-specific Multi-Token Prediction wiring.

Glue between the model-agnostic

mod:: nemo_automodel.components.models.common.mtp scaffolding and the NemotronV3 decoder block. Each MTP sublayer is a :class:NemotronV3Block configured for the requested per-depth block type ("attention" or "moe") plus, when relevant, the depth-level fusion modules (enorm, hnorm, eh_proj) and final_layernorm.

The internal parameter naming mirrors HuggingFace’s flat mtp.layers.{global_idx}.* convention used by the released Super V3 checkpoint, so the state-dict adapter performs an effectively 1-to-1 mapping.

Module Contents#

Classes#

NemotronV3MTPSublayer

One MTP sublayer for NemotronV3.

Functions#

`parse_mtp_layer_pattern`	Parse a NemotronH MTP layer pattern (e.g. `"*E"`) into block types.
`build_nemotron_v3_mtp`	Construct the NemotronV3 MTP block.
`build_mtp_config_from_hf`	Construct an :class:`MTPConfig` from an HF NemotronH config.

Data#

_PATTERN_SYMBOL_TO_BLOCK_TYPE

API#

nemo_automodel.components.models.nemotron_v3.mtp._PATTERN_SYMBOL_TO_BLOCK_TYPE#: None

nemo_automodel.components.models.nemotron_v3.mtp.parse_mtp_layer_pattern(pattern: str) → list[str]#

Parse a NemotronH MTP layer pattern (e.g. "*E") into block types.

Parameters:: pattern – Pattern string using symbols M (mamba), * (attention), - (mlp), E (moe).
Returns:: List of block-type names ("mamba", "attention", "mlp", "moe").
Raises:: ValueError – If the pattern is empty or contains unknown symbols.

class nemo_automodel.components.models.nemotron_v3.mtp.NemotronV3MTPSublayer( config, layer_idx: int, block_type: str, moe_config=None, backend: nemo_automodel.components.models.common.BackendConfig | None = None, has_fusion: bool = False, has_final_norm: bool = False, dtype: torch.dtype = torch.bfloat16, )#

Bases: nemo_automodel.components.models.nemotron_v3.layers.NemotronV3Block

One MTP sublayer for NemotronV3.

Inherits :class:NemotronV3Block so it has the same norm + mixer

residual structure as a main-backbone layer; optionally adds the fusion modules (enorm/hnorm/eh_proj) on the first sublayer of each depth and final_layernorm on the last sublayer of each depth.

Initialization

Initialize NemotronV3Block.

Parameters:

config – Model configuration with layers_block_type attribute
layer_idx – Index of this layer in the model
moe_config – MoE configuration (required for MoE layers)
backend – Backend configuration (optional)
block_type – Optional override for the block type. When None (default) the type is read from config.layers_block_type[layer_idx]. Used by callers that build extra blocks outside the main backbone’s per-layer pattern (e.g. MTP sublayers at indices past num_hidden_layers).

forward(

hidden_states: torch.Tensor,

*,

embed_input: torch.Tensor | None = None,

**kwargs,

) → torch.Tensor#

Run optional fusion (first sublayer of a depth), the base block, and optional final_layernorm (last sublayer of a depth).

Keeping the fusion + final-norm calls inside the sublayer’s own forward ensures FSDP2’s pre-forward unshard hook fires for every parameter we touch, so children like enorm/hnorm/eh_proj/final_layernorm are never accessed while their weights are still sharded DTensors.

init_weights(buffer_device: torch.device | None = None) → None#: Initialize sublayer weights, including fusion modules when present.

nemo_automodel.components.models.nemotron_v3.mtp.build_nemotron_v3_mtp( config, mtp_config: nemo_automodel.components.models.common.mtp.MTPConfig, backend: nemo_automodel.components.models.common.BackendConfig, moe_config, dtype: torch.dtype, ) → nemo_automodel.components.models.common.mtp.MTPModule#

Construct the NemotronV3 MTP block.

Parameters:

config – HF NemotronH config.
mtp_config – Parsed MTP runtime config.
backend – Backend configuration shared with the main backbone.
moe_config – MoE configuration shared with the main backbone (required when the MTP pattern contains MoE sublayers).
dtype – Target dtype for newly created linear modules.

Returns:

A configured :class:MTPModule. Caller should not invoke this when mtp_config.enabled is False.

nemo_automodel.components.models.nemotron_v3.mtp.build_mtp_config_from_hf( config, *, loss_scaling_factor: float = 0.1, num_nextn_predict_layers: int | None = None, use_repeated_layer: bool = False, ) → nemo_automodel.components.models.common.mtp.MTPConfig#

Construct an :class:MTPConfig from an HF NemotronH config.

Reads num_nextn_predict_layers and mtp_hybrid_override_pattern directly off the HF config object (both present on the released Super V3 config.json). Returns a disabled config (num_layers=0) when MTP is not configured.

Parameters:

config – HF NemotronH config.
loss_scaling_factor – Auxiliary-loss weight applied to the summed per-depth CE (default 0.1). Not stored on the HF config; override programmatically when constructing the model.
num_nextn_predict_layers – Optional override for the HF config’s num_nextn_predict_layers field. When None, uses the value from config. Set explicitly when the trained model used weight-tied MTP iterations (use_repeated_layer=True) and the HF export only retains the physical depth count.
use_repeated_layer – When True, build only one physical MTP depth and reuse it across all iterations. Mirrors Megatron’s --mtp-use-repeated-layer. Defaults to False.

Returns:

class:: MTPConfig.

nemo_automodel.components.models.nemotron_v3.mtp#

Module Contents#

Classes#

Functions#

Data#

API#

`nemo_automodel.components.models.nemotron_v3.mtp`#