nemo_automodel.components.models.nemotron_v3.mtp
nemo_automodel.components.models.nemotron_v3.mtp
NemotronV3-specific Multi-Token Prediction wiring.
Glue between the model-agnostic
:mod:nemo_automodel.components.models.common.mtp scaffolding and the
NemotronV3 decoder block. Each MTP sublayer is a :class:NemotronV3Block
configured for the requested per-depth block type ("attention" or
"moe") plus, when relevant, the depth-level fusion modules (enorm,
hnorm, eh_proj) and final_layernorm.
The internal parameter naming mirrors HuggingFace’s flat
mtp.layers.{global_idx}.* convention used by the released Super V3
checkpoint, so the state-dict adapter performs an effectively 1-to-1 mapping.
Module Contents
Classes
Functions
Data
API
Bases: NemotronV3Block
One MTP sublayer for NemotronV3.
Inherits :class:NemotronV3Block so it has the same norm + mixer
- residual structure as a main-backbone layer; optionally adds the fusion
modules (
enorm/hnorm/eh_proj) on the first sublayer of each depth andfinal_layernormon the last sublayer of each depth.
Run optional fusion (first sublayer of a depth), the base block, and optional final_layernorm (last sublayer of a depth).
Keeping the fusion + final-norm calls inside the sublayer’s own forward
ensures FSDP2’s pre-forward unshard hook fires for every parameter we
touch, so children like enorm/hnorm/eh_proj/final_layernorm
are never accessed while their weights are still sharded DTensors.
Initialize sublayer weights, including fusion modules when present.
Resolve the per-depth MTP block-type list from either HF field.
Super-V3 ships mtp_hybrid_override_pattern (symbol-string form like
"*E"). Newer NemotronH variants ship mtp_layers_block_type
(list-of-strings form like ["attention", "moe"]). Either is
accepted.
Parameters:
HF NemotronH config.
Returns: list[str] | None
Parsed list of block-type names, or None when neither field is set.
Raises:
ValueError: Ifmtp_layers_block_typecontains an unknown block type.
Construct an :class:MTPConfig from an HF NemotronH config.
Reads num_nextn_predict_layers and resolves the per-depth pattern from
either mtp_hybrid_override_pattern (Super-V3 symbol-string form) or
mtp_layers_block_type (list-of-strings form). Returns a disabled
config (num_layers=0) when MTP is not configured.
When the pattern source is the list form, :attr:MTPConfig.layer_pattern
is set to a length-matching sentinel string of "X" characters — the
actual block-type names are carried separately into
:func:build_nemotron_v3_mtp via its block_types kwarg.
Parameters:
HF NemotronH config.
Auxiliary-loss weight applied to the summed
per-depth CE (default 0.1). Not stored on the HF config;
override programmatically when constructing the model.
Optional override for the HF config’s
num_nextn_predict_layers field. When None, uses the value
from config. Set explicitly when the trained model used
weight-tied MTP iterations (use_repeated_layer=True) and the
HF export only retains the physical depth count.
When True, build only one physical MTP depth
and reuse it across all iterations. Mirrors Megatron’s
--mtp-use-repeated-layer. Defaults to False.
Returns: MTPConfig
class:MTPConfig.
Construct the NemotronV3 MTP block.
Parameters:
HF NemotronH config.
Parsed MTP runtime config.
Backend configuration shared with the main backbone.
MoE configuration shared with the main backbone (required when the MTP pattern contains MoE sublayers).
Target dtype for newly created linear modules.
Optional pre-parsed list of block-type names (one per
inner sublayer). When supplied, bypasses
:func:parse_mtp_layer_pattern on mtp_config.layer_pattern.
Required when mtp_config.layer_pattern is a length-only
sentinel (e.g. produced from mtp_layers_block_type).
Returns: MTPModule
A configured :class:MTPModule. Caller should not invoke this when
Parse a NemotronH MTP layer pattern (e.g. "*E") into block types.
Parameters:
Pattern string using symbols M (mamba), * (attention),
- (mlp), E (moe).
Returns: list[str]
List of block-type names ("mamba", "attention", "mlp", "moe").
Raises:
ValueError: If the pattern is empty or contains unknown symbols.