nemo_automodel.components.models.nemotron_v3.model

Module Contents

Classes

Name	Description
`NemotronHCausalLMOutputWithPast`	`CausalLMOutputWithPast` plus declared MTP fields.
`NemotronHForCausalLM`	NemotronV3 model with language modeling head.
`NemotronV3Model`	NemotronV3 base model (without LM head).

Data

ModelClass

API

class nemo_automodel.components.models.nemotron_v3.model.NemotronHCausalLMOutputWithPast(
    mtp_per_depth_h: typing.Optional[list[torch.Tensor]] = None,
    mtp_loss_scaling_factor: typing.Optional[float] = None
)

Dataclass

Bases: CausalLMOutputWithPast

CausalLMOutputWithPast plus declared MTP fields.

The MTP per-depth hidden states and scaling factor must be regular dataclass fields (rather than dynamically-set attributes) so they survive output-restructuring layers like FSDP2’s mixed-precision output cast, which rebuild ModelOutput instances from declared fields only.

mtp_loss_scaling_factor

Optional[float] = None

mtp_per_depth_h

Optional[list[Tensor]] = None

class nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM(
    config,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    mtp_loss_scaling_factor: float = 0.1,
    num_nextn_predict_layers: int | None = None,
    mtp_use_repeated_layer: bool = False,
    kwargs = {}
)

Bases: HFCheckpointingMixin, GenerationMixin, Module, MoEFSDPSyncMixin

NemotronV3 model with language modeling head.

Supports .generate() from transformers.generation.GenerationMixin with O(1) per-step KV caching for attention layers and recurrent state caching for Mamba2 layers.

_is_stateful

bool = True

_keep_in_fp32_modules_strict

= ['e_score_correction_bias']

_pp_keep_self_forward

bool = True

backend

= backend or BackendConfig()

device

Return the device of the first model parameter (required by GenerationMixin).

dtype

Return the dtype of the first model parameter (used by cache construction).

generation_config

= GenerationConfig()

lm_head

main_input_name

str = 'input_ids'

model

mtp

mtp_config

output_hidden_states

state_dict_adapter

nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM._build_mtp_embed_inputs_for_pp(
    input_ids: torch.Tensor
) -> tuple[torch.Tensor, ...]

Build the per-depth rolled-token embeddings on the first PP stage.

The first PP stage owns embed_tokens and is the only rank that can produce the future-token embeddings consumed by the MTP head on the final stage. The tuple flows alongside hidden_states through every intermediate stage as additional positional outputs (see forward).

Parameters:

input_ids

torch.Tensor

Token ids [B, S] (int).

Returns: torch.Tensor

Tuple of length self.mtp_config.num_layers containing

nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM._is_pipeline_parallel_stage() -> bool

True when this module instance has been trimmed to a PP stage subset.

Detection mirrors DeepseekV4ForCausalLM._is_pipeline_parallel_stage: any of (a) lm_head is None, (b) inner embed_tokens is None, (c) model.layers count diverges from config.num_hidden_layers is sufficient — the PP splitter nulls these attributes when trimming.

The checks use hasattr to distinguish “splitter nulled the attribute” (attribute present, value is None) from “caller replaced self.model with a stub that doesn’t declare the attribute” (attribute absent). Tests that swap in stub inner modules should not be misclassified as PP stages.

nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM._make_causal_mask(
    query_len: int,
    kv_len: int,
    batch_size: int,
    dtype: torch.dtype,
    device: torch.device
) -> torch.Tensor

staticmethod

Build a 4D SDPA-compatible causal mask.

Prefill (query_len == kv_len): standard lower-triangular causal mask. Decode (query_len == 1): all-zeros row allowing attention to all cached positions.

nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM.customize_pipeline_stage_modules(
    module_names_per_stage: list[list[str]],
    layers_prefix: str,
    text_model: torch.nn.Module | None = None
) -> list[list[str]]

Pin the MTP head to the last PP stage’s FQN list.

Called by split_model_into_stages (functional.py:494-502) after the default per-stage FQN auto-generation. The auto-generator includes embed_tokens on the first stage and norm/lm_head on the last stage but doesn’t know about model.mtp; this hook appends it.

nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM.forward(
    input_ids: typing.Optional[torch.LongTensor] = None,
    mtp_embed_inputs: torch.Tensor = (),
    attention_mask: typing.Optional[torch.Tensor] = None,
    causal_mask_mapping: typing.Optional[dict[str, torch.Tensor]] = None,
    inputs_embeds: typing.Optional[torch.FloatTensor] = None,
    labels: typing.Optional[torch.LongTensor] = None,
    past_key_values: typing.Optional[typing.Any] = None,
    use_cache: typing.Optional[bool] = None,
    cache_position: typing.Optional[torch.LongTensor] = None,
    position_ids: typing.Optional[torch.LongTensor] = None,
    padding_mask: typing.Optional[torch.Tensor] = None,
    logits_to_keep: typing.Union[int, torch.Tensor] = 0,
    output_hidden_states: typing.Optional[bool] = None,
    return_dict: typing.Optional[bool] = None,
    kwargs: typing.Any = {}
) -> transformers.modeling_outputs.CausalLMOutputWithPast

Forward pass with optional loss computation.

Supports both BSHD format (input_ids shape [B, S]) and THD format (input_ids shape [T] after squeeze_input_for_thd). When kwargs["qkv_format"] == "thd" AND the attention backend is TE, inputs are squeezed to THD before the base-model forward and logits are unsqueezed back to [1, T, V] on exit. SDPA / flex stay in BSHD.

Pipeline-parallel awareness: when run as a PP stage, input_ids is the upstream stage’s hidden-state tensor on non-first stages, and *mtp_embed_inputs carries num_nextn_predict_layers future-token embeddings produced by the first stage. See the Returns section below for the per-stage tuple contract. The single-rank (no-PP) path returns :class:NemotronHCausalLMOutputWithPast unchanged.

Parameters:

input_ids

Optional[torch.LongTensor]Defaults to None

Input token IDs. BSHD: [B, S]; THD: [1, T] (squeezed internally). On non-first PP stages this slot instead carries the upstream stage’s hidden-state tensor.

*mtp_embed_inputs

torch.TensorDefaults to ()

Pre-computed future-token embeddings produced by the first PP stage and forwarded between stages as positional args. Empty on the single-rank (no-PP) path.

attention_mask

Optional[torch.Tensor]Defaults to None

2D padding mask [B, S].

causal_mask_mapping

Optional[dict[str, torch.Tensor]]Defaults to None

Dict with precomputed 4D causal masks (key "full_attention" is consumed).

inputs_embeds

Optional[torch.FloatTensor]Defaults to None

Pre-computed input embeddings (optional).

labels

Optional[torch.LongTensor]Defaults to None

Token IDs for loss computation [B, S] (optional; under PP, loss is computed by PipelineCausalLMLoss).

past_key_values

Optional[Any]Defaults to None

Optional NemotronHybridCache for incremental decoding.

use_cache

Optional[bool]Defaults to None

Whether to return past_key_values for subsequent steps.

cache_position

Optional[torch.LongTensor]Defaults to None

Token position indices for cache updates.

position_ids

Optional[torch.LongTensor]Defaults to None

Position IDs (forwarded into MTP sublayer kwargs).

padding_mask

Optional[torch.Tensor]Defaults to None

Padding mask [B, S] used by the THD squeeze helper and as the MoE / mamba 2D mask source.

logits_to_keep

Union[int, torch.Tensor]Defaults to 0

If > 0, only compute logits for the last logits_to_keep token positions.

output_hidden_states

Optional[bool]Defaults to None

Whether to return hidden states.

return_dict

Optional[bool]Defaults to None

Accepted for API compatibility (always returns a NemotronHCausalLMOutputWithPast off-PP).

**kwargs

AnyDefaults to {}

Additional arguments forwarded to the base model (e.g. qkv_format, cu_seqlens, cu_seqlens_padded, max_seqlen, seq_idx, cp_rank, cp_size, _packed_seq_ids).

Returns: CausalLMOutputWithPast

Off-PP: :class:NemotronHCausalLMOutputWithPast with logits,

nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM.from_config(
    config,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    kwargs = {}
)

classmethod

Create model from config.

Parameters:

config

NemotronH config

backend

BackendConfig | NoneDefaults to None

Backend configuration

**kwargs

Defaults to {}

Additional arguments

Returns:

NemotronHForCausalLM instance

nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM.from_pretrained(
    pretrained_model_name_or_path: str,
    model_args = (),
    kwargs = {}
)

classmethod

Load pretrained model.

Parameters:

pretrained_model_name_or_path

str

Path or name of pretrained model

*model_args

Defaults to ()

Additional positional arguments

**kwargs

Defaults to {}

Additional keyword arguments

Returns:

NemotronHForCausalLM instance

nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM.get_input_embeddings()

nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM.get_output_embeddings()

nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM.get_pipeline_stage_metas(
    is_first: bool,
    microbatch_size: int,
    seq_len: int,
    dtype: torch.dtype
) -> tuple[tuple[torch.Tensor, ...], tuple[torch.Tensor, ...]]

Return analytical (inputs_meta, outputs_meta) for a PP stage.

Inter-stage tensors are plain [B, S, H] (no HC stream). With MTP enabled, every transfer carries 1 + D tensors so the variadic forward signature is exercised on every microbatch.

nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM.gradient_checkpointing_disable()

Unwrap any checkpoint-wrapped blocks (inverse of gradient_checkpointing_enable).

nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM.gradient_checkpointing_enable(
    gradient_checkpointing_kwargs = None
)

Enable activation checkpointing on each transformer (and MTP) block.

Wraps every decoder block (and MTP block, when present) with a non-reentrant checkpoint wrapper so that block activations are recomputed during the backward pass instead of being stored. This is the single-GPU entry point: FSDP2Manager.parallelize calls it when world_size == 1 (the expert-parallel path performs the equivalent wrapping inside the MoE parallelizer’s apply_ac). Without it, the hybrid Mamba2/Attention MoE keeps every block’s activations live, which is what pushes single-GPU LoRA SFT over a single 80GB device. Idempotent.

Parameters:

gradient_checkpointing_kwargs

Defaults to None

Accepted for HF API compatibility; currently unused (NO_REENTRANT wrapping is always used).

nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM.initialize_weights(
    buffer_device: torch.device | None = None,
    dtype: torch.dtype = torch.bfloat16
) -> None

Initialize model weights.

PP-aware: skips lm_head and mtp initialization when those have been trimmed to None on a non-owning stage. self.model itself also internally guards embed_tokens and norm.

Parameters:

buffer_device

torch.device | NoneDefaults to None

Device to use for buffer initialization

dtype

torch.dtypeDefaults to torch.bfloat16

Target dtype for model weights

nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM.prepare_inputs_for_generation(
    input_ids: torch.LongTensor,
    attention_mask: typing.Optional[torch.Tensor] = None,
    inputs_embeds: typing.Optional[torch.FloatTensor] = None,
    past_key_values: typing.Optional[typing.Any] = None,
    cache_position: typing.Optional[torch.LongTensor] = None,
    use_cache: typing.Optional[bool] = True,
    kwargs = {}
) -> dict

Prepare model inputs for each generation step.

On the first call (prefill), creates a :class:NemotronHybridCache and forwards the full prompt. On subsequent calls (decode), only the newly generated token is forwarded.

Parameters:

input_ids

torch.LongTensor

Accumulated token ids [batch_size, current_seq_len].

attention_mask

Optional[torch.Tensor]Defaults to None

Padding mask [batch_size, current_seq_len].

inputs_embeds

Optional[torch.FloatTensor]Defaults to None

Pre-computed embeddings for the first step (optional).

past_key_values

Optional[Any]Defaults to None

NemotronHybridCache from the previous step (None on first call).

cache_position

Optional[torch.LongTensor]Defaults to None

Token position indices.

use_cache

Optional[bool]Defaults to True

Whether to use caching (default True).

**kwargs

Defaults to {}

Remaining model kwargs.

Returns: dict

Dict of keyword arguments to pass to :meth:forward.

nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM.set_input_embeddings(
    value
)

nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM.set_output_embeddings(
    new_embeddings
)

class nemo_automodel.components.models.nemotron_v3.model.NemotronV3Model(
    config,
    backend: nemo_automodel.components.models.common.BackendConfig | None = None,
    moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
    moe_overrides: dict | None = None
)

Bases: Module

NemotronV3 base model (without LM head).

This is a hybrid architecture with Mamba2, Attention, MLP, and MoE layers.

_keep_in_fp32_modules_strict

= ['e_score_correction_bias']

backend

= backend or BackendConfig()

embed_tokens

layers

= nn.ModuleDict()

moe_config

= moe_config or MoEConfig(**moe_defaults)

norm

nemo_automodel.components.models.nemotron_v3.model.NemotronV3Model.forward(
    input_ids: torch.LongTensor | None = None,
    attention_mask: torch.Tensor | None = None,
    causal_mask_mapping: dict[str, torch.Tensor] | None = None,
    inputs_embeds: torch.Tensor | None = None,
    past_key_values = None,
    cache_position: torch.LongTensor | None = None,
    kwargs: typing.Any = {}
) -> torch.Tensor

Forward pass through the model. Supports BSHD [B, S, H] and THD [T, H].

Pipeline-parallel awareness: when self.embed_tokens is None (non-first PP stage), input_ids is interpreted as the upstream hidden-state tensor and routed through the inputs_embeds branch. When self.norm is None (non-last PP stage), the final norm is skipped.

nemo_automodel.components.models.nemotron_v3.model.NemotronV3Model.initialize_weights(
    buffer_device: torch.device | None = None
) -> None

Initialize model weights according to NemotronV3 spec.

After PP splitting, embed_tokens may be None on non-first stages and norm may be None on non-last stages; guard each.

Parameters:

buffer_device

torch.device | NoneDefaults to None

Device to use for buffer initialization

nemo_automodel.components.models.nemotron_v3.model.ModelClass = NemotronHForCausalLM