nemo_automodel.components.models.nemotron_v3.mtp#
NemotronV3-specific Multi-Token Prediction wiring.
Glue between the model-agnostic
- mod:
nemo_automodel.components.models.common.mtpscaffolding and the NemotronV3 decoder block. Each MTP sublayer is a :class:NemotronV3Blockconfigured for the requested per-depth block type ("attention"or"moe") plus, when relevant, the depth-level fusion modules (enorm,hnorm,eh_proj) andfinal_layernorm.
The internal parameter naming mirrors HuggingFace’s flat
mtp.layers.{global_idx}.* convention used by the released Super V3
checkpoint, so the state-dict adapter performs an effectively 1-to-1 mapping.
Module Contents#
Classes#
One MTP sublayer for NemotronV3. |
Functions#
Parse a NemotronH MTP layer pattern (e.g. |
|
Construct the NemotronV3 MTP block. |
|
Construct an :class: |
Data#
API#
- nemo_automodel.components.models.nemotron_v3.mtp._PATTERN_SYMBOL_TO_BLOCK_TYPE#
None
- nemo_automodel.components.models.nemotron_v3.mtp.parse_mtp_layer_pattern(pattern: str) list[str]#
Parse a NemotronH MTP layer pattern (e.g.
"*E") into block types.- Parameters:
pattern – Pattern string using symbols
M(mamba),*(attention),-(mlp),E(moe).- Returns:
List of block-type names (
"mamba","attention","mlp","moe").- Raises:
ValueError – If the pattern is empty or contains unknown symbols.
- class nemo_automodel.components.models.nemotron_v3.mtp.NemotronV3MTPSublayer(
- config,
- layer_idx: int,
- block_type: str,
- moe_config=None,
- backend: nemo_automodel.components.models.common.BackendConfig | None = None,
- has_fusion: bool = False,
- has_final_norm: bool = False,
- dtype: torch.dtype = torch.bfloat16,
Bases:
nemo_automodel.components.models.nemotron_v3.layers.NemotronV3BlockOne MTP sublayer for NemotronV3.
Inherits :class:
NemotronV3Blockso it has the samenorm+mixerresidual structure as a main-backbone layer; optionally adds the fusion modules (
enorm/hnorm/eh_proj) on the first sublayer of each depth andfinal_layernormon the last sublayer of each depth.
Initialization
Initialize NemotronV3Block.
- Parameters:
config – Model configuration with layers_block_type attribute
layer_idx – Index of this layer in the model
moe_config – MoE configuration (required for MoE layers)
backend – Backend configuration (optional)
block_type – Optional override for the block type. When
None(default) the type is read fromconfig.layers_block_type[layer_idx]. Used by callers that build extra blocks outside the main backbone’s per-layer pattern (e.g. MTP sublayers at indices pastnum_hidden_layers).
- forward(
- hidden_states: torch.Tensor,
- *,
- embed_input: torch.Tensor | None = None,
- **kwargs,
Run optional fusion (first sublayer of a depth), the base block, and optional final_layernorm (last sublayer of a depth).
Keeping the fusion + final-norm calls inside the sublayer’s own forward ensures FSDP2’s pre-forward unshard hook fires for every parameter we touch, so children like
enorm/hnorm/eh_proj/final_layernormare never accessed while their weights are still sharded DTensors.
- init_weights(buffer_device: torch.device | None = None) None#
Initialize sublayer weights, including fusion modules when present.
- nemo_automodel.components.models.nemotron_v3.mtp.build_nemotron_v3_mtp(
- config,
- mtp_config: nemo_automodel.components.models.common.mtp.MTPConfig,
- backend: nemo_automodel.components.models.common.BackendConfig,
- moe_config,
- dtype: torch.dtype,
Construct the NemotronV3 MTP block.
- Parameters:
config – HF NemotronH config.
mtp_config – Parsed MTP runtime config.
backend – Backend configuration shared with the main backbone.
moe_config – MoE configuration shared with the main backbone (required when the MTP pattern contains MoE sublayers).
dtype – Target dtype for newly created linear modules.
- Returns:
A configured :class:
MTPModule. Caller should not invoke this whenmtp_config.enabledisFalse.
- nemo_automodel.components.models.nemotron_v3.mtp.build_mtp_config_from_hf(
- config,
- *,
- loss_scaling_factor: float = 0.1,
- num_nextn_predict_layers: int | None = None,
- use_repeated_layer: bool = False,
Construct an :class:
MTPConfigfrom an HF NemotronH config.Reads
num_nextn_predict_layersandmtp_hybrid_override_patterndirectly off the HF config object (both present on the released Super V3config.json). Returns a disabled config (num_layers=0) when MTP is not configured.- Parameters:
config – HF NemotronH config.
loss_scaling_factor – Auxiliary-loss weight applied to the summed per-depth CE (default
0.1). Not stored on the HF config; override programmatically when constructing the model.num_nextn_predict_layers – Optional override for the HF config’s
num_nextn_predict_layersfield. WhenNone, uses the value fromconfig. Set explicitly when the trained model used weight-tied MTP iterations (use_repeated_layer=True) and the HF export only retains the physical depth count.use_repeated_layer – When
True, build only one physical MTP depth and reuse it across all iterations. Mirrors Megatron’s--mtp-use-repeated-layer. Defaults toFalse.
- Returns:
- class:
MTPConfig.