nemo_automodel.components.models.nemotron_v3.model#
Module Contents#
Classes#
|
|
NemotronV3 base model (without LM head). |
|
NemotronV3 model with language modeling head. |
Data#
API#
- class nemo_automodel.components.models.nemotron_v3.model.NemotronHCausalLMOutputWithPast#
Bases:
transformers.modeling_outputs.CausalLMOutputWithPastCausalLMOutputWithPastplus declared MTP fields.The MTP per-depth hidden states and scaling factor must be regular dataclass fields (rather than dynamically-set attributes) so they survive output-restructuring layers like FSDP2βs mixed-precision output cast, which rebuild
ModelOutputinstances from declared fields only.- mtp_per_depth_h: Optional[list[torch.Tensor]]#
None
- mtp_loss_scaling_factor: Optional[float]#
None
- class nemo_automodel.components.models.nemotron_v3.model.NemotronV3Model(
- config,
- backend: nemo_automodel.components.models.common.BackendConfig | None = None,
- *,
- moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
- moe_overrides: dict | None = None,
Bases:
torch.nn.ModuleNemotronV3 base model (without LM head).
This is a hybrid architecture with Mamba2, Attention, MLP, and MoE layers.
Initialization
Initialize NemotronV3Model.
- Parameters:
config β NemotronH config with model parameters
backend β Backend configuration for MoE and other components
moe_config β MoE configuration (optional, will create default if None)
moe_overrides β Optional dict of overrides to apply to the default MoE config
- forward(
- input_ids: torch.LongTensor | None = None,
- *,
- attention_mask: torch.Tensor | None = None,
- causal_mask_mapping: dict[str, torch.Tensor] | None = None,
- inputs_embeds: torch.Tensor | None = None,
- past_key_values=None,
- cache_position: torch.LongTensor | None = None,
- **kwargs: Any,
Forward pass through the model. Supports BSHD
[B, S, H]and THD[T, H].
- initialize_weights(buffer_device: torch.device | None = None) None#
Initialize model weights according to NemotronV3 spec.
- Parameters:
buffer_device β Device to use for buffer initialization
- class nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM(
- config,
- backend: nemo_automodel.components.models.common.BackendConfig | None = None,
- *,
- mtp_loss_scaling_factor: float = 0.1,
- num_nextn_predict_layers: int | None = None,
- mtp_use_repeated_layer: bool = False,
- **kwargs,
Bases:
nemo_automodel.components.models.common.HFCheckpointingMixin,transformers.generation.GenerationMixin,torch.nn.Module,nemo_automodel.components.moe.fsdp_mixin.MoEFSDPSyncMixinNemotronV3 model with language modeling head.
Supports
.generate()fromtransformers.generation.GenerationMixinwith O(1) per-step KV caching for attention layers and recurrent state caching for Mamba2 layers.Initialization
Initialize NemotronV3ForCausalLM.
- Parameters:
config β NemotronH config.
backend β Backend configuration.
mtp_loss_scaling_factor β Auxiliary-loss weight for the MTP head (default
0.1). Programmatic override only β not exposed as a YAML knob to keep recipe configs auto-detected.num_nextn_predict_layers β Optional override for the HF configβs
num_nextn_predict_layersfield (i.e. the MTP forward iteration count). WhenNone, the value fromconfigis used. Set explicitly when the trained model used weight-tied MTP (mtp_use_repeated_layer=True) and the HF export only retains the physical depth count.mtp_use_repeated_layer β When
True, build a single physical MTP depth and reuse it across all iterations. Mirrors Megatronβs--mtp-use-repeated-layer. Defaults toFalse.**kwargs β Additional arguments. Recognized keys:
moe_config,moe_overrides.
- _is_stateful: bool#
True
- main_input_name: str#
βinput_idsβ
- classmethod from_config(
- config,
- backend: nemo_automodel.components.models.common.BackendConfig | None = None,
- **kwargs,
Create model from config.
- Parameters:
config β NemotronH config
backend β Backend configuration
**kwargs β Additional arguments
- Returns:
NemotronHForCausalLM instance
- classmethod from_pretrained(
- pretrained_model_name_or_path: str,
- *model_args,
- **kwargs,
Load pretrained model.
- Parameters:
pretrained_model_name_or_path β Path or name of pretrained model
*model_args β Additional positional arguments
**kwargs β Additional keyword arguments
- Returns:
NemotronHForCausalLM instance
- property device: torch.device#
Return the device of the first model parameter (required by GenerationMixin).
- property dtype: torch.dtype#
Return the dtype of the first model parameter (used by cache construction).
- get_input_embeddings()#
- set_input_embeddings(value)#
- get_output_embeddings()#
- set_output_embeddings(new_embeddings)#
- forward(
- input_ids: Optional[torch.LongTensor] = None,
- attention_mask: Optional[torch.Tensor] = None,
- causal_mask_mapping: Optional[dict[str, torch.Tensor]] = None,
- inputs_embeds: Optional[torch.FloatTensor] = None,
- labels: Optional[torch.LongTensor] = None,
- past_key_values: Optional[Any] = None,
- use_cache: Optional[bool] = None,
- cache_position: Optional[torch.LongTensor] = None,
- position_ids: Optional[torch.LongTensor] = None,
- padding_mask: Optional[torch.Tensor] = None,
- logits_to_keep: Union[int, torch.Tensor] = 0,
- output_hidden_states: Optional[bool] = None,
- return_dict: Optional[bool] = None,
- **kwargs: Any,
Forward pass with optional loss computation.
Supports both BSHD format (
input_idsshape[B, S]) and THD format (input_idsshape[T]aftersqueeze_input_for_thd). Whenkwargs["qkv_format"] == "thd", inputs are squeezed to THD before the base-model forward and logits are unsqueezed back to[1, T, V]on exit.- Parameters:
input_ids β Input token IDs. BSHD:
[B, S]; THD:[1, T](squeezed internally).attention_mask β 2D padding mask
[B, S].causal_mask_mapping β Dict with precomputed 4D causal masks.
inputs_embeds β Pre-computed input embeddings (optional).
labels β Token IDs for loss computation
[B, S](optional).past_key_values β Optional NemotronHybridCache for incremental decoding.
use_cache β Whether to return past_key_values for subsequent steps.
cache_position β Token position indices for cache updates.
position_ids β Unused β accepted for API compatibility with GenerationMixin.
padding_mask β Padding mask
[B, S]used by THD squeeze helper.logits_to_keep β If > 0, only compute logits for the last
logits_to_keeptoken positions (avoids materialising the full logit matrix during generation).output_hidden_states β Whether to return hidden states.
return_dict β Accepted for API compatibility (always returns CausalLMOutputWithPast).
**kwargs β Additional arguments forwarded to the base model (e.g. seq_idx, cu_seqlens, qkv_format, CP kwargs).
- Returns:
- class:
~transformers.modeling_outputs.CausalLMOutputWithPastwithlogits(float32), optionalloss,past_key_values, andhidden_states.
- static _make_causal_mask(
- query_len: int,
- kv_len: int,
- batch_size: int,
- dtype: torch.dtype,
- device: torch.device,
Build a 4D SDPA-compatible causal mask.
Prefill (query_len == kv_len): standard lower-triangular causal mask. Decode (query_len == 1): all-zeros row allowing attention to all cached positions.
- prepare_inputs_for_generation(
- input_ids: torch.LongTensor,
- attention_mask: Optional[torch.Tensor] = None,
- inputs_embeds: Optional[torch.FloatTensor] = None,
- past_key_values: Optional[Any] = None,
- cache_position: Optional[torch.LongTensor] = None,
- use_cache: Optional[bool] = True,
- **kwargs,
Prepare model inputs for each generation step.
On the first call (prefill), creates a :class:
NemotronHybridCacheand forwards the full prompt. On subsequent calls (decode), only the newly generated token is forwarded.- Parameters:
input_ids β Accumulated token ids [batch_size, current_seq_len].
attention_mask β Padding mask [batch_size, current_seq_len].
inputs_embeds β Pre-computed embeddings for the first step (optional).
past_key_values β NemotronHybridCache from the previous step (None on first call).
cache_position β Token position indices.
use_cache β Whether to use caching (default True).
**kwargs β Remaining model kwargs.
- Returns:
Dict of keyword arguments to pass to :meth:
forward.
- initialize_weights(
- buffer_device: torch.device | None = None,
- dtype: torch.dtype = torch.bfloat16,
Initialize model weights.
- Parameters:
buffer_device β Device to use for buffer initialization
dtype β Target dtype for model weights
- nemo_automodel.components.models.nemotron_v3.model.ModelClass#
None