`nemo_automodel.components.models.nemotron_v3.model`#

Module Contents#

Classes#

`NemotronHCausalLMOutputWithPast`	`CausalLMOutputWithPast` plus declared MTP fields.
`NemotronV3Model`	NemotronV3 base model (without LM head).
`NemotronHForCausalLM`	NemotronV3 model with language modeling head.

Data#

ModelClass

API#

class nemo_automodel.components.models.nemotron_v3.model.NemotronHCausalLMOutputWithPast#

Bases: transformers.modeling_outputs.CausalLMOutputWithPast

CausalLMOutputWithPast plus declared MTP fields.

The MTP per-depth hidden states and scaling factor must be regular dataclass fields (rather than dynamically-set attributes) so they survive output-restructuring layers like FSDP2’s mixed-precision output cast, which rebuild ModelOutput instances from declared fields only.

mtp_per_depth_h: Optional[list[torch.Tensor]]#: None

mtp_loss_scaling_factor: Optional[float]#: None

class nemo_automodel.components.models.nemotron_v3.model.NemotronV3Model( config, backend: nemo_automodel.components.models.common.BackendConfig | None = None, *, moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None, moe_overrides: dict | None = None, )#

Bases: torch.nn.Module

NemotronV3 base model (without LM head).

This is a hybrid architecture with Mamba2, Attention, MLP, and MoE layers.

Initialization

Initialize NemotronV3Model.

Parameters:

config – NemotronH config with model parameters
backend – Backend configuration for MoE and other components
moe_config – MoE configuration (optional, will create default if None)
moe_overrides – Optional dict of overrides to apply to the default MoE config

forward(

input_ids: torch.LongTensor | None = None,

*,

attention_mask: torch.Tensor | None = None,

causal_mask_mapping: dict[str, torch.Tensor] | None = None,

inputs_embeds: torch.Tensor | None = None,

past_key_values=None,

cache_position: torch.LongTensor | None = None,

**kwargs: Any,

) → torch.Tensor#: Forward pass through the model. Supports BSHD [B, S, H] and THD [T, H].

initialize_weights(buffer_device: torch.device | None = None) → None#

Initialize model weights according to NemotronV3 spec.

Parameters:: buffer_device – Device to use for buffer initialization

class nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM(

config,

backend: nemo_automodel.components.models.common.BackendConfig | None = None,

*,

mtp_loss_scaling_factor: float = 0.1,

num_nextn_predict_layers: int | None = None,

mtp_use_repeated_layer: bool = False,

**kwargs,

)#

Bases: nemo_automodel.components.models.common.HFCheckpointingMixin, transformers.generation.GenerationMixin, torch.nn.Module, nemo_automodel.components.moe.fsdp_mixin.MoEFSDPSyncMixin

NemotronV3 model with language modeling head.

Supports .generate() from transformers.generation.GenerationMixin with O(1) per-step KV caching for attention layers and recurrent state caching for Mamba2 layers.

Initialization

Initialize NemotronV3ForCausalLM.

Parameters:

config – NemotronH config.
backend – Backend configuration.
mtp_loss_scaling_factor – Auxiliary-loss weight for the MTP head (default 0.1). Programmatic override only — not exposed as a YAML knob to keep recipe configs auto-detected.
num_nextn_predict_layers – Optional override for the HF config’s num_nextn_predict_layers field (i.e. the MTP forward iteration count). When None, the value from config is used. Set explicitly when the trained model used weight-tied MTP (mtp_use_repeated_layer=True) and the HF export only retains the physical depth count.
mtp_use_repeated_layer – When True, build a single physical MTP depth and reuse it across all iterations. Mirrors Megatron’s --mtp-use-repeated-layer. Defaults to False.
**kwargs – Additional arguments. Recognized keys: moe_config, moe_overrides.

_is_stateful: bool#: True

main_input_name: str#: ‘input_ids’

classmethod from_config(

config,

backend: nemo_automodel.components.models.common.BackendConfig | None = None,

**kwargs,

)#

Create model from config.

Parameters:

config – NemotronH config
backend – Backend configuration
**kwargs – Additional arguments

Returns:

NemotronHForCausalLM instance

classmethod from_pretrained(

pretrained_model_name_or_path: str,

*model_args,

**kwargs,

)#

Load pretrained model.

Parameters:

pretrained_model_name_or_path – Path or name of pretrained model
*model_args – Additional positional arguments
**kwargs – Additional keyword arguments

Returns:

NemotronHForCausalLM instance

property device: torch.device#: Return the device of the first model parameter (required by GenerationMixin).

property dtype: torch.dtype#: Return the dtype of the first model parameter (used by cache construction).

get_input_embeddings()#

set_input_embeddings(value)#

get_output_embeddings()#

set_output_embeddings(new_embeddings)#

forward(

input_ids: Optional[torch.LongTensor] = None,

attention_mask: Optional[torch.Tensor] = None,

causal_mask_mapping: Optional[dict[str, torch.Tensor]] = None,

inputs_embeds: Optional[torch.FloatTensor] = None,

labels: Optional[torch.LongTensor] = None,

past_key_values: Optional[Any] = None,

use_cache: Optional[bool] = None,

cache_position: Optional[torch.LongTensor] = None,

position_ids: Optional[torch.LongTensor] = None,

padding_mask: Optional[torch.Tensor] = None,

logits_to_keep: Union[int, torch.Tensor] = 0,

output_hidden_states: Optional[bool] = None,

return_dict: Optional[bool] = None,

**kwargs: Any,

) → transformers.modeling_outputs.CausalLMOutputWithPast#

Forward pass with optional loss computation.

Supports both BSHD format (input_ids shape [B, S]) and THD format (input_ids shape [T] after squeeze_input_for_thd). When kwargs["qkv_format"] == "thd", inputs are squeezed to THD before the base-model forward and logits are unsqueezed back to [1, T, V] on exit.

Parameters:

input_ids – Input token IDs. BSHD: [B, S]; THD: [1, T] (squeezed internally).
attention_mask – 2D padding mask [B, S].
causal_mask_mapping – Dict with precomputed 4D causal masks.
inputs_embeds – Pre-computed input embeddings (optional).
labels – Token IDs for loss computation [B, S] (optional).
past_key_values – Optional NemotronHybridCache for incremental decoding.
use_cache – Whether to return past_key_values for subsequent steps.
cache_position – Token position indices for cache updates.
position_ids – Unused – accepted for API compatibility with GenerationMixin.
padding_mask – Padding mask [B, S] used by THD squeeze helper.
logits_to_keep – If > 0, only compute logits for the last logits_to_keep token positions (avoids materialising the full logit matrix during generation).
output_hidden_states – Whether to return hidden states.
return_dict – Accepted for API compatibility (always returns CausalLMOutputWithPast).
**kwargs – Additional arguments forwarded to the base model (e.g. seq_idx, cu_seqlens, qkv_format, CP kwargs).

Returns:

class:: ~transformers.modeling_outputs.CausalLMOutputWithPast with logits (float32), optional loss, past_key_values, and hidden_states.

static _make_causal_mask( query_len: int, kv_len: int, batch_size: int, dtype: torch.dtype, device: torch.device, ) → torch.Tensor#

Build a 4D SDPA-compatible causal mask.

Prefill (query_len == kv_len): standard lower-triangular causal mask. Decode (query_len == 1): all-zeros row allowing attention to all cached positions.

prepare_inputs_for_generation(

input_ids: torch.LongTensor,

attention_mask: Optional[torch.Tensor] = None,

inputs_embeds: Optional[torch.FloatTensor] = None,

past_key_values: Optional[Any] = None,

cache_position: Optional[torch.LongTensor] = None,

use_cache: Optional[bool] = True,

**kwargs,

) → dict#

Prepare model inputs for each generation step.

On the first call (prefill), creates a :class:NemotronHybridCache and forwards the full prompt. On subsequent calls (decode), only the newly generated token is forwarded.

Parameters:

input_ids – Accumulated token ids [batch_size, current_seq_len].
attention_mask – Padding mask [batch_size, current_seq_len].
inputs_embeds – Pre-computed embeddings for the first step (optional).
past_key_values – NemotronHybridCache from the previous step (None on first call).
cache_position – Token position indices.
use_cache – Whether to use caching (default True).
**kwargs – Remaining model kwargs.

Returns:

Dict of keyword arguments to pass to :meth:forward.

initialize_weights( buffer_device: torch.device | None = None, dtype: torch.dtype = torch.bfloat16, ) → None#

Initialize model weights.

Parameters:

buffer_device – Device to use for buffer initialization
dtype – Target dtype for model weights

nemo_automodel.components.models.nemotron_v3.model.ModelClass#: None

nemo_automodel.components.models.nemotron_v3.model#

Module Contents#

Classes#

Data#

API#

`nemo_automodel.components.models.nemotron_v3.model`#