nemo_automodel.components.models.qwen2.model

View as Markdown

Custom Qwen2 model implementation for NeMo Automodel.

This module provides a self-contained Qwen2 implementation with separate HuggingFace-style q/k/v and gate/up projections.

Example (YAML):

model:
_target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
pretrained_model_name_or_path: Qwen/Qwen2.5-7B

Module Contents

Classes

NameDescription
Qwen2AttentionMulti-headed attention with separate QKV projections — HuggingFace default layout.
Qwen2DecoderLayerSingle Qwen2 decoder layer with RMSNorm, attention, and MLP.
Qwen2ForCausalLMQwen2 model with causal language modeling head.
Qwen2ModelQwen2 transformer model (embeddings + decoder layers + norm).
Qwen2PreTrainedModelAbstract class for Qwen2 pretrained models.
Qwen2SeparateMLPSwiGLU MLP with separate gate_proj and up_proj — identical to HuggingFace default.

Data

ModelClass

__all__

check_model_inputs

API

class nemo_automodel.components.models.qwen2.model.Qwen2Attention(
config: transformers.Qwen2Config,
layer_idx: int,
backend: typing.Optional['BackendConfig'] = None
)

Bases: Module

Multi-headed attention with separate QKV projections — HuggingFace default layout.

attention_dropout
= config.attention_dropout
head_dim
k_proj
num_key_value_groups
o_proj
q_proj
rope_fusion
= getattr(backend, 'rope_fusion', False)
scaling
= self.head_dim ** -0.5
sliding_window
v_proj
nemo_automodel.components.models.qwen2.model.Qwen2Attention.forward(
hidden_states: torch.Tensor,
position_embeddings: tuple[torch.Tensor, torch.Tensor],
attention_mask: typing.Optional[torch.Tensor],
past_key_values: typing.Optional[transformers.cache_utils.Cache] = None,
cache_position: typing.Optional[torch.LongTensor] = None,
kwargs: transformers.processing_utils.Unpack[transformers.utils.TransformersKwargs] = {}
) -> tuple[torch.Tensor, torch.Tensor]
class nemo_automodel.components.models.qwen2.model.Qwen2DecoderLayer(
config: transformers.Qwen2Config,
layer_idx: int,
backend: nemo_automodel.components.models.common.BackendConfig
)

Bases: GradientCheckpointingLayer

Single Qwen2 decoder layer with RMSNorm, attention, and MLP.

attention_type
= config.layer_types[layer_idx]
hidden_size
= config.hidden_size
input_layernorm
mlp
= Qwen2SeparateMLP(config=config)
post_attention_layernorm
self_attn
nemo_automodel.components.models.qwen2.model.Qwen2DecoderLayer.forward(
hidden_states: torch.Tensor,
attention_mask: typing.Optional[torch.Tensor] = None,
position_ids: typing.Optional[torch.LongTensor] = None,
past_key_values: typing.Optional[transformers.cache_utils.Cache] = None,
use_cache: typing.Optional[bool] = False,
cache_position: typing.Optional[torch.LongTensor] = None,
position_embeddings: typing.Optional[tuple[torch.Tensor, torch.Tensor]] = None,
kwargs: transformers.processing_utils.Unpack[transformers.utils.TransformersKwargs] = {}
) -> torch.Tensor
class nemo_automodel.components.models.qwen2.model.Qwen2ForCausalLM(
config: transformers.Qwen2Config,
backend: typing.Optional[nemo_automodel.components.models.common.BackendConfig] = None
)

Bases: HFCheckpointingMixin, Qwen2PreTrainedModel

Qwen2 model with causal language modeling head.

Uses separate q/k/v and gate/up projections — HuggingFace layout.

_pp_plan
= {'lm_head': (['hidden_states'], ['logits'])}
_tied_weights_keys
= {'lm_head.weight': 'model.embed_tokens.weight'}
_tp_plan
= {'lm_head': 'colwise_rep'}
backend
= backend or BackendConfig()
lm_head
model
= Qwen2Model(config=config, backend=(self.backend))
state_dict_adapter
= Qwen2StateDictAdapter(config=(self.config))
vocab_size
= config.vocab_size
nemo_automodel.components.models.qwen2.model.Qwen2ForCausalLM.forward(
input_ids: typing.Optional[torch.LongTensor] = None,
attention_mask: typing.Optional[torch.Tensor] = None,
position_ids: typing.Optional[torch.LongTensor] = None,
past_key_values: typing.Optional[transformers.cache_utils.Cache] = None,
inputs_embeds: typing.Optional[torch.FloatTensor] = None,
labels: typing.Optional[torch.LongTensor] = None,
use_cache: typing.Optional[bool] = None,
output_attentions: typing.Optional[bool] = None,
output_hidden_states: typing.Optional[bool] = None,
return_dict: typing.Optional[bool] = None,
cache_position: typing.Optional[torch.LongTensor] = None,
logits_to_keep: typing.Union[int, torch.Tensor] = 0,
kwargs: transformers.processing_utils.Unpack[transformers.utils.TransformersKwargs] = {}
) -> transformers.modeling_outputs.CausalLMOutputWithPast

Forward pass returning CausalLMOutputWithPast.

nemo_automodel.components.models.qwen2.model.Qwen2ForCausalLM.get_input_embeddings()
nemo_automodel.components.models.qwen2.model.Qwen2ForCausalLM.get_output_embeddings()
nemo_automodel.components.models.qwen2.model.Qwen2ForCausalLM.set_input_embeddings(
value
)
nemo_automodel.components.models.qwen2.model.Qwen2ForCausalLM.set_output_embeddings(
new_embeddings
)
class nemo_automodel.components.models.qwen2.model.Qwen2Model(
config: transformers.Qwen2Config,
backend: nemo_automodel.components.models.common.BackendConfig
)

Bases: Qwen2PreTrainedModel

Qwen2 transformer model (embeddings + decoder layers + norm).

embed_tokens
has_sliding_layers
= 'sliding_attention' in self.config.layer_types
layers
norm
padding_idx
= config.pad_token_id
rotary_emb
vocab_size
= config.vocab_size
nemo_automodel.components.models.qwen2.model.Qwen2Model.forward(
input_ids: typing.Optional[torch.LongTensor] = None,
attention_mask: typing.Optional[torch.Tensor] = None,
position_ids: typing.Optional[torch.LongTensor] = None,
past_key_values: typing.Optional[transformers.cache_utils.Cache] = None,
inputs_embeds: typing.Optional[torch.FloatTensor] = None,
use_cache: typing.Optional[bool] = None,
output_attentions: typing.Optional[bool] = None,
output_hidden_states: typing.Optional[bool] = None,
return_dict: typing.Optional[bool] = None,
cache_position: typing.Optional[torch.LongTensor] = None,
kwargs: transformers.processing_utils.Unpack[transformers.utils.TransformersKwargs] = {}
) -> transformers.modeling_outputs.BaseModelOutputWithPast
class nemo_automodel.components.models.qwen2.model.Qwen2PreTrainedModel()

Bases: PreTrainedModel

Abstract class for Qwen2 pretrained models.

_can_record_outputs
_no_split_modules
= ['Qwen2DecoderLayer']
_skip_keys_device_placement
= ['past_key_values']
base_model_prefix
= 'model'
class nemo_automodel.components.models.qwen2.model.Qwen2SeparateMLP(
config: transformers.Qwen2Config
)

Bases: Module

SwiGLU MLP with separate gate_proj and up_proj — identical to HuggingFace default.

act_fn
= ACT2FN[config.hidden_act]
down_proj
gate_proj
hidden_size
= config.hidden_size
intermediate_size
= config.intermediate_size
up_proj
nemo_automodel.components.models.qwen2.model.Qwen2SeparateMLP.forward(
x: torch.Tensor
) -> torch.Tensor
nemo_automodel.components.models.qwen2.model.ModelClass = Qwen2ForCausalLM
nemo_automodel.components.models.qwen2.model.__all__ = ['Qwen2ForCausalLM']
nemo_automodel.components.models.qwen2.model.check_model_inputs = get_check_model_inputs_decorator()