nemo_automodel.components.models.qwen2.model

Custom Qwen2 model implementation for NeMo Automodel.

This module provides a self-contained Qwen2 implementation with separate HuggingFace-style q/k/v and gate/up projections.

Example (YAML):

model:
  _target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
  pretrained_model_name_or_path: Qwen/Qwen2.5-7B

Module Contents

Classes

Name	Description
`Qwen2Attention`	Multi-headed attention with separate QKV projections — HuggingFace default layout.
`Qwen2DecoderLayer`	Single Qwen2 decoder layer with RMSNorm, attention, and MLP.
`Qwen2ForCausalLM`	Qwen2 model with causal language modeling head.
`Qwen2Model`	Qwen2 transformer model (embeddings + decoder layers + norm).
`Qwen2PreTrainedModel`	Abstract class for Qwen2 pretrained models.
`Qwen2SeparateMLP`	SwiGLU MLP with separate gate_proj and up_proj — identical to HuggingFace default.

Data

ModelClass

__all__

check_model_inputs

API

class nemo_automodel.components.models.qwen2.model.Qwen2Attention(
    config: transformers.Qwen2Config,
    layer_idx: int,
    backend: typing.Optional['BackendConfig'] = None
)

Bases: Module

Multi-headed attention with separate QKV projections — HuggingFace default layout.

attention_dropout

= config.attention_dropout

head_dim

k_proj

num_key_value_groups

o_proj

q_proj

rope_fusion

= getattr(backend, 'rope_fusion', False)

scaling

= self.head_dim ** -0.5

sliding_window

v_proj

nemo_automodel.components.models.qwen2.model.Qwen2Attention.forward(
    hidden_states: torch.Tensor,
    position_embeddings: tuple[torch.Tensor, torch.Tensor],
    attention_mask: typing.Optional[torch.Tensor],
    past_key_values: typing.Optional[transformers.cache_utils.Cache] = None,
    cache_position: typing.Optional[torch.LongTensor] = None,
    kwargs: transformers.processing_utils.Unpack[transformers.utils.TransformersKwargs] = {}
) -> tuple[torch.Tensor, torch.Tensor]

class nemo_automodel.components.models.qwen2.model.Qwen2DecoderLayer(
    config: transformers.Qwen2Config,
    layer_idx: int,
    backend: nemo_automodel.components.models.common.BackendConfig
)

Bases: GradientCheckpointingLayer

Single Qwen2 decoder layer with RMSNorm, attention, and MLP.

attention_type

= config.layer_types[layer_idx]

hidden_size

= config.hidden_size

input_layernorm

mlp

= Qwen2SeparateMLP(config=config)

post_attention_layernorm

self_attn

nemo_automodel.components.models.qwen2.model.Qwen2DecoderLayer.forward(
    hidden_states: torch.Tensor,
    attention_mask: typing.Optional[torch.Tensor] = None,
    position_ids: typing.Optional[torch.LongTensor] = None,
    past_key_values: typing.Optional[transformers.cache_utils.Cache] = None,
    use_cache: typing.Optional[bool] = False,
    cache_position: typing.Optional[torch.LongTensor] = None,
    position_embeddings: typing.Optional[tuple[torch.Tensor, torch.Tensor]] = None,
    kwargs: transformers.processing_utils.Unpack[transformers.utils.TransformersKwargs] = {}
) -> torch.Tensor

class nemo_automodel.components.models.qwen2.model.Qwen2ForCausalLM(
    config: transformers.Qwen2Config,
    backend: typing.Optional[nemo_automodel.components.models.common.BackendConfig] = None
)

Bases: HFCheckpointingMixin, Qwen2PreTrainedModel

Qwen2 model with causal language modeling head.

Uses separate q/k/v and gate/up projections — HuggingFace layout.

_pp_plan

= {'lm_head': (['hidden_states'], ['logits'])}

_tied_weights_keys

= {'lm_head.weight': 'model.embed_tokens.weight'}

_tp_plan

= {'lm_head': 'colwise_rep'}

backend

= backend or BackendConfig()

lm_head

model

= Qwen2Model(config=config, backend=(self.backend))

state_dict_adapter

= Qwen2StateDictAdapter(config=(self.config))

vocab_size

= config.vocab_size

nemo_automodel.components.models.qwen2.model.Qwen2ForCausalLM.forward(
    input_ids: typing.Optional[torch.LongTensor] = None,
    attention_mask: typing.Optional[torch.Tensor] = None,
    position_ids: typing.Optional[torch.LongTensor] = None,
    past_key_values: typing.Optional[transformers.cache_utils.Cache] = None,
    inputs_embeds: typing.Optional[torch.FloatTensor] = None,
    labels: typing.Optional[torch.LongTensor] = None,
    use_cache: typing.Optional[bool] = None,
    output_attentions: typing.Optional[bool] = None,
    output_hidden_states: typing.Optional[bool] = None,
    return_dict: typing.Optional[bool] = None,
    cache_position: typing.Optional[torch.LongTensor] = None,
    logits_to_keep: typing.Union[int, torch.Tensor] = 0,
    kwargs: transformers.processing_utils.Unpack[transformers.utils.TransformersKwargs] = {}
) -> transformers.modeling_outputs.CausalLMOutputWithPast

Forward pass returning CausalLMOutputWithPast.

nemo_automodel.components.models.qwen2.model.Qwen2ForCausalLM.get_input_embeddings()

nemo_automodel.components.models.qwen2.model.Qwen2ForCausalLM.get_output_embeddings()

nemo_automodel.components.models.qwen2.model.Qwen2ForCausalLM.set_input_embeddings(
    value
)

nemo_automodel.components.models.qwen2.model.Qwen2ForCausalLM.set_output_embeddings(
    new_embeddings
)

class nemo_automodel.components.models.qwen2.model.Qwen2Model(
    config: transformers.Qwen2Config,
    backend: nemo_automodel.components.models.common.BackendConfig
)

Bases: Qwen2PreTrainedModel

Qwen2 transformer model (embeddings + decoder layers + norm).

embed_tokens

has_sliding_layers

= 'sliding_attention' in self.config.layer_types

layers

norm

padding_idx

= config.pad_token_id

rotary_emb

vocab_size

= config.vocab_size

nemo_automodel.components.models.qwen2.model.Qwen2Model.forward(
    input_ids: typing.Optional[torch.LongTensor] = None,
    attention_mask: typing.Optional[torch.Tensor] = None,
    position_ids: typing.Optional[torch.LongTensor] = None,
    past_key_values: typing.Optional[transformers.cache_utils.Cache] = None,
    inputs_embeds: typing.Optional[torch.FloatTensor] = None,
    use_cache: typing.Optional[bool] = None,
    output_attentions: typing.Optional[bool] = None,
    output_hidden_states: typing.Optional[bool] = None,
    return_dict: typing.Optional[bool] = None,
    cache_position: typing.Optional[torch.LongTensor] = None,
    kwargs: transformers.processing_utils.Unpack[transformers.utils.TransformersKwargs] = {}
) -> transformers.modeling_outputs.BaseModelOutputWithPast

class nemo_automodel.components.models.qwen2.model.Qwen2PreTrainedModel()

Bases: PreTrainedModel

Abstract class for Qwen2 pretrained models.

_can_record_outputs

_no_split_modules

= ['Qwen2DecoderLayer']

_skip_keys_device_placement

= ['past_key_values']

base_model_prefix

= 'model'

class nemo_automodel.components.models.qwen2.model.Qwen2SeparateMLP(
    config: transformers.Qwen2Config
)

Bases: Module

SwiGLU MLP with separate gate_proj and up_proj — identical to HuggingFace default.

act_fn

= ACT2FN[config.hidden_act]

down_proj

gate_proj

hidden_size

= config.hidden_size

intermediate_size

= config.intermediate_size

up_proj

nemo_automodel.components.models.qwen2.model.Qwen2SeparateMLP.forward(
    x: torch.Tensor
) -> torch.Tensor

nemo_automodel.components.models.qwen2.model.ModelClass = Qwen2ForCausalLM

nemo_automodel.components.models.qwen2.model.__all__ = ['Qwen2ForCausalLM']

nemo_automodel.components.models.qwen2.model.check_model_inputs = get_check_model_inputs_decorator()