nemo_automodel.components.models.nemotron_v3.mtp#

NemotronV3-specific Multi-Token Prediction wiring.

Glue between the model-agnostic

mod:

nemo_automodel.components.models.common.mtp scaffolding and the NemotronV3 decoder block. Each MTP sublayer is a :class:NemotronV3Block configured for the requested per-depth block type ("attention" or "moe") plus, when relevant, the depth-level fusion modules (enorm, hnorm, eh_proj) and final_layernorm.

The internal parameter naming mirrors HuggingFace’s flat mtp.layers.{global_idx}.* convention used by the released Super V3 checkpoint, so the state-dict adapter performs an effectively 1-to-1 mapping.

Module Contents#

Classes#

NemotronV3MTPSublayer

One MTP sublayer for NemotronV3.

Functions#

parse_mtp_layer_pattern

Parse a NemotronH MTP layer pattern (e.g. "*E") into block types.

build_nemotron_v3_mtp

Construct the NemotronV3 MTP block.

build_mtp_config_from_hf

Construct an :class:MTPConfig from an HF NemotronH config.

Data#

API#

nemo_automodel.components.models.nemotron_v3.mtp._PATTERN_SYMBOL_TO_BLOCK_TYPE#

None

nemo_automodel.components.models.nemotron_v3.mtp.parse_mtp_layer_pattern(pattern: str) list[str]#

Parse a NemotronH MTP layer pattern (e.g. "*E") into block types.

Parameters:

pattern – Pattern string using symbols M (mamba), * (attention), - (mlp), E (moe).

Returns:

List of block-type names ("mamba", "attention", "mlp", "moe").

Raises:

ValueError – If the pattern is empty or contains unknown symbols.

class nemo_automodel.components.models.nemotron_v3.mtp.NemotronV3MTPSublayer(
config,
layer_idx: int,
block_type: str,
moe_config=None,
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
has_fusion: bool = False,
has_final_norm: bool = False,
dtype: torch.dtype = torch.bfloat16,
)#

Bases: nemo_automodel.components.models.nemotron_v3.layers.NemotronV3Block

One MTP sublayer for NemotronV3.

Inherits :class:NemotronV3Block so it has the same norm + mixer

  • residual structure as a main-backbone layer; optionally adds the fusion modules (enorm/hnorm/eh_proj) on the first sublayer of each depth and final_layernorm on the last sublayer of each depth.

Initialization

Initialize NemotronV3Block.

Parameters:
  • config – Model configuration with layers_block_type attribute

  • layer_idx – Index of this layer in the model

  • moe_config – MoE configuration (required for MoE layers)

  • backend – Backend configuration (optional)

  • block_type – Optional override for the block type. When None (default) the type is read from config.layers_block_type[layer_idx]. Used by callers that build extra blocks outside the main backbone’s per-layer pattern (e.g. MTP sublayers at indices past num_hidden_layers).

forward(
hidden_states: torch.Tensor,
*,
embed_input: torch.Tensor | None = None,
**kwargs,
) torch.Tensor#

Run optional fusion (first sublayer of a depth), the base block, and optional final_layernorm (last sublayer of a depth).

Keeping the fusion + final-norm calls inside the sublayer’s own forward ensures FSDP2’s pre-forward unshard hook fires for every parameter we touch, so children like enorm/hnorm/eh_proj/final_layernorm are never accessed while their weights are still sharded DTensors.

init_weights(buffer_device: torch.device | None = None) None#

Initialize sublayer weights, including fusion modules when present.

nemo_automodel.components.models.nemotron_v3.mtp.build_nemotron_v3_mtp(
config,
mtp_config: nemo_automodel.components.models.common.mtp.MTPConfig,
backend: nemo_automodel.components.models.common.BackendConfig,
moe_config,
dtype: torch.dtype,
) nemo_automodel.components.models.common.mtp.MTPModule#

Construct the NemotronV3 MTP block.

Parameters:
  • config – HF NemotronH config.

  • mtp_config – Parsed MTP runtime config.

  • backend – Backend configuration shared with the main backbone.

  • moe_config – MoE configuration shared with the main backbone (required when the MTP pattern contains MoE sublayers).

  • dtype – Target dtype for newly created linear modules.

Returns:

A configured :class:MTPModule. Caller should not invoke this when mtp_config.enabled is False.

nemo_automodel.components.models.nemotron_v3.mtp.build_mtp_config_from_hf(
config,
*,
loss_scaling_factor: float = 0.1,
num_nextn_predict_layers: int | None = None,
use_repeated_layer: bool = False,
) nemo_automodel.components.models.common.mtp.MTPConfig#

Construct an :class:MTPConfig from an HF NemotronH config.

Reads num_nextn_predict_layers and mtp_hybrid_override_pattern directly off the HF config object (both present on the released Super V3 config.json). Returns a disabled config (num_layers=0) when MTP is not configured.

Parameters:
  • config – HF NemotronH config.

  • loss_scaling_factor – Auxiliary-loss weight applied to the summed per-depth CE (default 0.1). Not stored on the HF config; override programmatically when constructing the model.

  • num_nextn_predict_layers – Optional override for the HF config’s num_nextn_predict_layers field. When None, uses the value from config. Set explicitly when the trained model used weight-tied MTP iterations (use_repeated_layer=True) and the HF export only retains the physical depth count.

  • use_repeated_layer – When True, build only one physical MTP depth and reuse it across all iterations. Mirrors Megatron’s --mtp-use-repeated-layer. Defaults to False.

Returns:

class:

MTPConfig.