nemo_automodel.components.models.nemotron_v3.model

View as Markdown

Module Contents

Classes

NameDescription
NemotronHCausalLMOutputWithPastCausalLMOutputWithPast plus declared MTP fields.
NemotronHForCausalLMNemotronV3 model with language modeling head.
NemotronV3ModelNemotronV3 base model (without LM head).

Data

ModelClass

API

class nemo_automodel.components.models.nemotron_v3.model.NemotronHCausalLMOutputWithPast(
mtp_per_depth_h: typing.Optional[list[torch.Tensor]] = None,
mtp_loss_scaling_factor: typing.Optional[float] = None
)
Dataclass

Bases: CausalLMOutputWithPast

CausalLMOutputWithPast plus declared MTP fields.

The MTP per-depth hidden states and scaling factor must be regular dataclass fields (rather than dynamically-set attributes) so they survive output-restructuring layers like FSDP2’s mixed-precision output cast, which rebuild ModelOutput instances from declared fields only.

mtp_loss_scaling_factor
Optional[float] = None
mtp_per_depth_h
Optional[list[Tensor]] = None
class nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM(
config,
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
mtp_loss_scaling_factor: float = 0.1,
num_nextn_predict_layers: int | None = None,
mtp_use_repeated_layer: bool = False,
kwargs = {}
)

Bases: HFCheckpointingMixin, GenerationMixin, Module, MoEFSDPSyncMixin

NemotronV3 model with language modeling head.

Supports .generate() from transformers.generation.GenerationMixin with O(1) per-step KV caching for attention layers and recurrent state caching for Mamba2 layers.

_is_stateful
bool = True
_keep_in_fp32_modules_strict
= ['e_score_correction_bias']
_pp_keep_self_forward
bool = True
backend
= backend or BackendConfig()
device
device

Return the device of the first model parameter (required by GenerationMixin).

dtype
dtype

Return the dtype of the first model parameter (used by cache construction).

generation_config
= GenerationConfig()
lm_head
main_input_name
str = 'input_ids'
model
mtp
mtp_config
output_hidden_states
state_dict_adapter
nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM._build_mtp_embed_inputs_for_pp(
input_ids: torch.Tensor
) -> tuple[torch.Tensor, ...]

Build the per-depth rolled-token embeddings on the first PP stage.

The first PP stage owns embed_tokens and is the only rank that can produce the future-token embeddings consumed by the MTP head on the final stage. The tuple flows alongside hidden_states through every intermediate stage as additional positional outputs (see forward).

Parameters:

input_ids
torch.Tensor

Token ids [B, S] (int).

Returns: torch.Tensor

Tuple of length self.mtp_config.num_layers containing

nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM._is_pipeline_parallel_stage() -> bool

True when this module instance has been trimmed to a PP stage subset.

Detection mirrors DeepseekV4ForCausalLM._is_pipeline_parallel_stage: any of (a) lm_head is None, (b) inner embed_tokens is None, (c) model.layers count diverges from config.num_hidden_layers is sufficient — the PP splitter nulls these attributes when trimming.

The checks use hasattr to distinguish “splitter nulled the attribute” (attribute present, value is None) from “caller replaced self.model with a stub that doesn’t declare the attribute” (attribute absent). Tests that swap in stub inner modules should not be misclassified as PP stages.

nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM._make_causal_mask(
query_len: int,
kv_len: int,
batch_size: int,
dtype: torch.dtype,
device: torch.device
) -> torch.Tensor
staticmethod

Build a 4D SDPA-compatible causal mask.

Prefill (query_len == kv_len): standard lower-triangular causal mask. Decode (query_len == 1): all-zeros row allowing attention to all cached positions.

nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM.customize_pipeline_stage_modules(
module_names_per_stage: list[list[str]],
layers_prefix: str,
text_model: torch.nn.Module | None = None
) -> list[list[str]]

Pin the MTP head to the last PP stage’s FQN list.

Called by split_model_into_stages (functional.py:494-502) after the default per-stage FQN auto-generation. The auto-generator includes embed_tokens on the first stage and norm/lm_head on the last stage but doesn’t know about model.mtp; this hook appends it.

nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM.forward(
input_ids: typing.Optional[torch.LongTensor] = None,
mtp_embed_inputs: torch.Tensor = (),
attention_mask: typing.Optional[torch.Tensor] = None,
causal_mask_mapping: typing.Optional[dict[str, torch.Tensor]] = None,
inputs_embeds: typing.Optional[torch.FloatTensor] = None,
labels: typing.Optional[torch.LongTensor] = None,
past_key_values: typing.Optional[typing.Any] = None,
use_cache: typing.Optional[bool] = None,
cache_position: typing.Optional[torch.LongTensor] = None,
position_ids: typing.Optional[torch.LongTensor] = None,
padding_mask: typing.Optional[torch.Tensor] = None,
logits_to_keep: typing.Union[int, torch.Tensor] = 0,
output_hidden_states: typing.Optional[bool] = None,
return_dict: typing.Optional[bool] = None,
kwargs: typing.Any = {}
) -> transformers.modeling_outputs.CausalLMOutputWithPast

Forward pass with optional loss computation.

Supports both BSHD format (input_ids shape [B, S]) and THD format (input_ids shape [T] after squeeze_input_for_thd). When kwargs["qkv_format"] == "thd" AND the attention backend is TE, inputs are squeezed to THD before the base-model forward and logits are unsqueezed back to [1, T, V] on exit. SDPA / flex stay in BSHD.

Pipeline-parallel awareness: when run as a PP stage, input_ids is the upstream stage’s hidden-state tensor on non-first stages, and *mtp_embed_inputs carries num_nextn_predict_layers future-token embeddings produced by the first stage. See the Returns section below for the per-stage tuple contract. The single-rank (no-PP) path returns :class:NemotronHCausalLMOutputWithPast unchanged.

Parameters:

input_ids
Optional[torch.LongTensor]Defaults to None

Input token IDs. BSHD: [B, S]; THD: [1, T] (squeezed internally). On non-first PP stages this slot instead carries the upstream stage’s hidden-state tensor.

*mtp_embed_inputs
torch.TensorDefaults to ()

Pre-computed future-token embeddings produced by the first PP stage and forwarded between stages as positional args. Empty on the single-rank (no-PP) path.

attention_mask
Optional[torch.Tensor]Defaults to None

2D padding mask [B, S].

causal_mask_mapping
Optional[dict[str, torch.Tensor]]Defaults to None

Dict with precomputed 4D causal masks (key "full_attention" is consumed).

inputs_embeds
Optional[torch.FloatTensor]Defaults to None

Pre-computed input embeddings (optional).

labels
Optional[torch.LongTensor]Defaults to None

Token IDs for loss computation [B, S] (optional; under PP, loss is computed by PipelineCausalLMLoss).

past_key_values
Optional[Any]Defaults to None

Optional NemotronHybridCache for incremental decoding.

use_cache
Optional[bool]Defaults to None

Whether to return past_key_values for subsequent steps.

cache_position
Optional[torch.LongTensor]Defaults to None

Token position indices for cache updates.

position_ids
Optional[torch.LongTensor]Defaults to None

Position IDs (forwarded into MTP sublayer kwargs).

padding_mask
Optional[torch.Tensor]Defaults to None

Padding mask [B, S] used by the THD squeeze helper and as the MoE / mamba 2D mask source.

logits_to_keep
Union[int, torch.Tensor]Defaults to 0

If > 0, only compute logits for the last logits_to_keep token positions.

output_hidden_states
Optional[bool]Defaults to None

Whether to return hidden states.

return_dict
Optional[bool]Defaults to None

Accepted for API compatibility (always returns a NemotronHCausalLMOutputWithPast off-PP).

**kwargs
AnyDefaults to {}

Additional arguments forwarded to the base model (e.g. qkv_format, cu_seqlens, cu_seqlens_padded, max_seqlen, seq_idx, cp_rank, cp_size, _packed_seq_ids).

Returns: CausalLMOutputWithPast

Off-PP: :class:NemotronHCausalLMOutputWithPast with logits,

nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM.from_config(
config,
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
kwargs = {}
)
classmethod

Create model from config.

Parameters:

config

NemotronH config

backend
BackendConfig | NoneDefaults to None

Backend configuration

**kwargs
Defaults to {}

Additional arguments

Returns:

NemotronHForCausalLM instance

nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM.from_pretrained(
pretrained_model_name_or_path: str,
model_args = (),
kwargs = {}
)
classmethod

Load pretrained model.

Parameters:

pretrained_model_name_or_path
str

Path or name of pretrained model

*model_args
Defaults to ()

Additional positional arguments

**kwargs
Defaults to {}

Additional keyword arguments

Returns:

NemotronHForCausalLM instance

nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM.get_input_embeddings()
nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM.get_output_embeddings()
nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM.get_pipeline_stage_metas(
is_first: bool,
microbatch_size: int,
seq_len: int,
dtype: torch.dtype
) -> tuple[tuple[torch.Tensor, ...], tuple[torch.Tensor, ...]]

Return analytical (inputs_meta, outputs_meta) for a PP stage.

Inter-stage tensors are plain [B, S, H] (no HC stream). With MTP enabled, every transfer carries 1 + D tensors so the variadic forward signature is exercised on every microbatch.

nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM.gradient_checkpointing_disable()

Unwrap any checkpoint-wrapped blocks (inverse of gradient_checkpointing_enable).

nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM.gradient_checkpointing_enable(
gradient_checkpointing_kwargs = None
)

Enable activation checkpointing on each transformer (and MTP) block.

Wraps every decoder block (and MTP block, when present) with a non-reentrant checkpoint wrapper so that block activations are recomputed during the backward pass instead of being stored. This is the single-GPU entry point: FSDP2Manager.parallelize calls it when world_size == 1 (the expert-parallel path performs the equivalent wrapping inside the MoE parallelizer’s apply_ac). Without it, the hybrid Mamba2/Attention MoE keeps every block’s activations live, which is what pushes single-GPU LoRA SFT over a single 80GB device. Idempotent.

Parameters:

gradient_checkpointing_kwargs
Defaults to None

Accepted for HF API compatibility; currently unused (NO_REENTRANT wrapping is always used).

nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM.initialize_weights(
buffer_device: torch.device | None = None,
dtype: torch.dtype = torch.bfloat16
) -> None

Initialize model weights.

PP-aware: skips lm_head and mtp initialization when those have been trimmed to None on a non-owning stage. self.model itself also internally guards embed_tokens and norm.

Parameters:

buffer_device
torch.device | NoneDefaults to None

Device to use for buffer initialization

dtype
torch.dtypeDefaults to torch.bfloat16

Target dtype for model weights

nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM.prepare_inputs_for_generation(
input_ids: torch.LongTensor,
attention_mask: typing.Optional[torch.Tensor] = None,
inputs_embeds: typing.Optional[torch.FloatTensor] = None,
past_key_values: typing.Optional[typing.Any] = None,
cache_position: typing.Optional[torch.LongTensor] = None,
use_cache: typing.Optional[bool] = True,
kwargs = {}
) -> dict

Prepare model inputs for each generation step.

On the first call (prefill), creates a :class:NemotronHybridCache and forwards the full prompt. On subsequent calls (decode), only the newly generated token is forwarded.

Parameters:

input_ids
torch.LongTensor

Accumulated token ids [batch_size, current_seq_len].

attention_mask
Optional[torch.Tensor]Defaults to None

Padding mask [batch_size, current_seq_len].

inputs_embeds
Optional[torch.FloatTensor]Defaults to None

Pre-computed embeddings for the first step (optional).

past_key_values
Optional[Any]Defaults to None

NemotronHybridCache from the previous step (None on first call).

cache_position
Optional[torch.LongTensor]Defaults to None

Token position indices.

use_cache
Optional[bool]Defaults to True

Whether to use caching (default True).

**kwargs
Defaults to {}

Remaining model kwargs.

Returns: dict

Dict of keyword arguments to pass to :meth:forward.

nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM.set_input_embeddings(
value
)
nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM.set_output_embeddings(
new_embeddings
)
class nemo_automodel.components.models.nemotron_v3.model.NemotronV3Model(
config,
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
moe_overrides: dict | None = None
)

Bases: Module

NemotronV3 base model (without LM head).

This is a hybrid architecture with Mamba2, Attention, MLP, and MoE layers.

_keep_in_fp32_modules_strict
= ['e_score_correction_bias']
backend
= backend or BackendConfig()
embed_tokens
layers
= nn.ModuleDict()
moe_config
= moe_config or MoEConfig(**moe_defaults)
norm
nemo_automodel.components.models.nemotron_v3.model.NemotronV3Model.forward(
input_ids: torch.LongTensor | None = None,
attention_mask: torch.Tensor | None = None,
causal_mask_mapping: dict[str, torch.Tensor] | None = None,
inputs_embeds: torch.Tensor | None = None,
past_key_values = None,
cache_position: torch.LongTensor | None = None,
kwargs: typing.Any = {}
) -> torch.Tensor

Forward pass through the model. Supports BSHD [B, S, H] and THD [T, H].

Pipeline-parallel awareness: when self.embed_tokens is None (non-first PP stage), input_ids is interpreted as the upstream hidden-state tensor and routed through the inputs_embeds branch. When self.norm is None (non-last PP stage), the final norm is skipped.

nemo_automodel.components.models.nemotron_v3.model.NemotronV3Model.initialize_weights(
buffer_device: torch.device | None = None
) -> None

Initialize model weights according to NemotronV3 spec.

After PP splitting, embed_tokens may be None on non-first stages and norm may be None on non-last stages; guard each.

Parameters:

buffer_device
torch.device | NoneDefaults to None

Device to use for buffer initialization

nemo_automodel.components.models.nemotron_v3.model.ModelClass = NemotronHForCausalLM