nemo_automodel.components.models.nemotron_v3.model#

Module Contents#

Classes#

NemotronV3Model

NemotronV3 base model (without LM head).

NemotronHForCausalLM

NemotronV3 model with language modeling head.

Data#

API#

class nemo_automodel.components.models.nemotron_v3.model.NemotronV3Model(
config,
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
*,
moe_config: nemo_automodel.components.moe.config.MoEConfig | None = None,
)#

Bases: torch.nn.Module

NemotronV3 base model (without LM head).

This is a hybrid architecture with Mamba2, Attention, MLP, and MoE layers.

Initialization

Initialize NemotronV3Model.

Parameters:
  • config – NemotronH config with model parameters

  • backend – Backend configuration for MoE and other components

  • moe_config – MoE configuration (optional, will create default if None)

forward(
input_ids: torch.LongTensor | None = None,
*,
attention_mask: torch.Tensor | None = None,
causal_mask_mapping: dict[str, torch.Tensor] | None = None,
inputs_embeds: torch.Tensor | None = None,
**kwargs: Any,
) torch.Tensor#

Forward pass through the model.

Parameters:
  • input_ids – Input token IDs [batch_size, seq_len] (optional)

  • attention_mask – 2D padding mask [batch_size, seq_len] (1=real, 0=padding)

  • causal_mask_mapping – Dict with precomputed 4D causal masks for attention layers

  • inputs_embeds – Input embeddings [batch_size, seq_len, hidden_size] (optional)

  • **kwargs – Additional arguments (ignored)

Returns:

Hidden states tensor [batch_size, seq_len, hidden_size]

initialize_weights(buffer_device: torch.device | None = None) None#

Initialize model weights according to NemotronV3 spec.

Parameters:

buffer_device – Device to use for buffer initialization

class nemo_automodel.components.models.nemotron_v3.model.NemotronHForCausalLM(
config,
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
**kwargs,
)#

Bases: nemo_automodel.components.models.common.HFCheckpointingMixin, torch.nn.Module, nemo_automodel.components.moe.fsdp_mixin.MoEFSDPSyncMixin

NemotronV3 model with language modeling head.

Initialization

Initialize NemotronV3ForCausalLM.

Parameters:
  • config – NemotronH config

  • backend – Backend configuration

  • **kwargs – Additional arguments

classmethod from_config(
config,
backend: nemo_automodel.components.models.common.BackendConfig | None = None,
**kwargs,
)#

Create model from config.

Parameters:
  • config – NemotronH config

  • backend – Backend configuration

  • **kwargs – Additional arguments

Returns:

NemotronHForCausalLM instance

classmethod from_pretrained(
pretrained_model_name_or_path: str,
*model_args,
**kwargs,
)#

Load pretrained model.

Parameters:
  • pretrained_model_name_or_path – Path or name of pretrained model

  • *model_args – Additional positional arguments

  • **kwargs – Additional keyword arguments

Returns:

NemotronHForCausalLM instance

forward(
input_ids: torch.LongTensor | None = None,
*,
attention_mask: torch.Tensor | None = None,
causal_mask_mapping: dict[str, torch.Tensor] | None = None,
**kwargs: Any,
) torch.Tensor | dict[str, torch.Tensor]#

Forward pass with optional loss computation.

Parameters:
  • input_ids – Input token IDs [batch_size, seq_len] (optional)

  • attention_mask – 2D padding mask [batch_size, seq_len]

  • causal_mask_mapping – Dict with precomputed 4D causal masks

  • **kwargs – Additional arguments

Returns:

logits tensor [batch_size, seq_len, vocab_size]

initialize_weights(
buffer_device: torch.device | None = None,
dtype: torch.dtype = torch.bfloat16,
) None#

Initialize model weights.

Parameters:
  • buffer_device – Device to use for buffer initialization

  • dtype – Target dtype for model weights

nemo_automodel.components.models.nemotron_v3.model.ModelClass#

None