`nemo_automodel.components.models.qwen2.model`#

Custom Qwen2 model implementation for NeMo Automodel.

This module provides a self-contained Qwen2 implementation with separate HuggingFace-style q/k/v and gate/up projections.

Example (YAML):

model:
  _target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
  pretrained_model_name_or_path: Qwen/Qwen2.5-7B

Module Contents#

Classes#

`Qwen2Attention`	Multi-headed attention with separate QKV projections — HuggingFace default layout.
`Qwen2SeparateMLP`	SwiGLU MLP with separate gate_proj and up_proj – identical to HuggingFace default.
`Qwen2DecoderLayer`	Single Qwen2 decoder layer with RMSNorm, attention, and MLP.
`Qwen2PreTrainedModel`	Abstract class for Qwen2 pretrained models.
`Qwen2Model`	Qwen2 transformer model (embeddings + decoder layers + norm).
`Qwen2ForCausalLM`	Qwen2 model with causal language modeling head.

Data#

`__all__`
`check_model_inputs`
`ModelClass`

API#

nemo_automodel.components.models.qwen2.model.__all__#: [‘Qwen2ForCausalLM’]

nemo_automodel.components.models.qwen2.model.check_model_inputs#: ‘get_check_model_inputs_decorator(…)’

class nemo_automodel.components.models.qwen2.model.Qwen2Attention( config: transformers.Qwen2Config, layer_idx: int, backend: Optional[nemo_automodel.components.models.common.BackendConfig] = None, )#

Bases: torch.nn.Module

Multi-headed attention with separate QKV projections — HuggingFace default layout.

Initialization

forward(

hidden_states: torch.Tensor,

position_embeddings: tuple[torch.Tensor, torch.Tensor],

attention_mask: Optional[torch.Tensor],

past_key_values: Optional[transformers.cache_utils.Cache] = None,

cache_position: Optional[torch.LongTensor] = None,

**kwargs: transformers.processing_utils.Unpack[transformers.utils.TransformersKwargs],

) → tuple[torch.Tensor, torch.Tensor]#

class nemo_automodel.components.models.qwen2.model.Qwen2SeparateMLP(config: transformers.Qwen2Config)#

Bases: torch.nn.Module

SwiGLU MLP with separate gate_proj and up_proj – identical to HuggingFace default.

Initialization

forward(x: torch.Tensor) → torch.Tensor#

class nemo_automodel.components.models.qwen2.model.Qwen2DecoderLayer( config: transformers.Qwen2Config, layer_idx: int, backend: nemo_automodel.components.models.common.BackendConfig, )#

Bases: transformers.modeling_layers.GradientCheckpointingLayer

Single Qwen2 decoder layer with RMSNorm, attention, and MLP.

Initialization

forward(

hidden_states: torch.Tensor,

attention_mask: Optional[torch.Tensor] = None,

position_ids: Optional[torch.LongTensor] = None,

past_key_values: Optional[transformers.cache_utils.Cache] = None,

use_cache: Optional[bool] = False,

cache_position: Optional[torch.LongTensor] = None,

position_embeddings: Optional[tuple[torch.Tensor, torch.Tensor]] = None,

**kwargs: transformers.processing_utils.Unpack[transformers.utils.TransformersKwargs],

) → torch.Tensor#

class nemo_automodel.components.models.qwen2.model.Qwen2PreTrainedModel#

Bases: transformers.modeling_utils.PreTrainedModel

Abstract class for Qwen2 pretrained models.

config_class#: None

base_model_prefix#: ‘model’

supports_gradient_checkpointing#: True

_no_split_modules#: [‘Qwen2DecoderLayer’]

_skip_keys_device_placement#: [‘past_key_values’]

_supports_flash_attn#: True

_supports_sdpa#: True

_supports_flex_attn#: True

_can_compile_fullgraph#: True

_supports_attention_backend#: True

_can_record_outputs#: None

class nemo_automodel.components.models.qwen2.model.Qwen2Model( config: transformers.Qwen2Config, backend: nemo_automodel.components.models.common.BackendConfig, )#

Bases: nemo_automodel.components.models.qwen2.model.Qwen2PreTrainedModel

Qwen2 transformer model (embeddings + decoder layers + norm).

Initialization

forward(

input_ids: Optional[torch.LongTensor] = None,

attention_mask: Optional[torch.Tensor] = None,

position_ids: Optional[torch.LongTensor] = None,

past_key_values: Optional[transformers.cache_utils.Cache] = None,

inputs_embeds: Optional[torch.FloatTensor] = None,

use_cache: Optional[bool] = None,

output_attentions: Optional[bool] = None,

output_hidden_states: Optional[bool] = None,

return_dict: Optional[bool] = None,

cache_position: Optional[torch.LongTensor] = None,

**kwargs: transformers.processing_utils.Unpack[transformers.utils.TransformersKwargs],

) → transformers.modeling_outputs.BaseModelOutputWithPast#

class nemo_automodel.components.models.qwen2.model.Qwen2ForCausalLM( config: transformers.Qwen2Config, backend: Optional[nemo_automodel.components.models.common.BackendConfig] = None, )#

Bases: nemo_automodel.components.models.common.hf_checkpointing_mixin.HFCheckpointingMixin, nemo_automodel.components.models.qwen2.model.Qwen2PreTrainedModel

Qwen2 model with causal language modeling head.

Uses separate q/k/v and gate/up projections – HuggingFace layout.

Initialization

_tied_weights_keys#: None

_tp_plan#: None

_pp_plan#: None

get_input_embeddings()#

set_input_embeddings(value)#

get_output_embeddings()#

set_output_embeddings(new_embeddings)#

forward(

input_ids: Optional[torch.LongTensor] = None,

attention_mask: Optional[torch.Tensor] = None,

position_ids: Optional[torch.LongTensor] = None,

past_key_values: Optional[transformers.cache_utils.Cache] = None,

inputs_embeds: Optional[torch.FloatTensor] = None,

labels: Optional[torch.LongTensor] = None,

use_cache: Optional[bool] = None,

output_attentions: Optional[bool] = None,

output_hidden_states: Optional[bool] = None,

return_dict: Optional[bool] = None,

cache_position: Optional[torch.LongTensor] = None,

logits_to_keep: Union[int, torch.Tensor] = 0,

**kwargs: transformers.processing_utils.Unpack[transformers.utils.TransformersKwargs],

) → transformers.modeling_outputs.CausalLMOutputWithPast#: Forward pass returning CausalLMOutputWithPast.

nemo_automodel.components.models.qwen2.model.ModelClass#: None

nemo_automodel.components.models.qwen2.model#

Module Contents#

Classes#

Data#

API#

`nemo_automodel.components.models.qwen2.model`#