nemo_automodel.components.models.qwen2.model#

Custom Qwen2 model implementation for NeMo Automodel.

This module provides a self-contained Qwen2 implementation with combined QKV/gate_up projections. Uses shared components from common/ for fused projections.

Example (YAML):

model:
  _target_: nemo_automodel.NeMoAutoModelForCausalLM.from_pretrained
  pretrained_model_name_or_path: Qwen/Qwen2.5-7B
  use_fused_qkv: true
  use_fused_gate_up: true

Module Contents#

Classes#

Qwen2Attention

Multi-headed attention with combined QKV projection.

Qwen2DecoderLayer

Single Qwen2 decoder layer with RMSNorm, attention, and combined MLP.

Qwen2PreTrainedModel

Abstract class for Qwen2 pretrained models.

Qwen2Model

Qwen2 transformer model (embeddings + decoder layers + norm).

Qwen2ForCausalLM

Qwen2 model with causal language modeling head.

Data#

API#

nemo_automodel.components.models.qwen2.model.__all__#

[‘Qwen2ForCausalLM’]

nemo_automodel.components.models.qwen2.model.check_model_inputs#

‘get_check_model_inputs_decorator(…)’

class nemo_automodel.components.models.qwen2.model.Qwen2Attention(config: transformers.Qwen2Config, layer_idx: int)#

Bases: nemo_automodel.components.models.common.CombinedQKVAttentionMixin, torch.nn.Module

Multi-headed attention with combined QKV projection.

Uses CombinedQKVAttentionMixin for efficient combined QKV projection. ALWAYS uses combined projections - this is the whole point of the custom implementation.

Initialization

forward(
hidden_states: torch.Tensor,
position_embeddings: tuple[torch.Tensor, torch.Tensor],
attention_mask: Optional[torch.Tensor],
past_key_values: Optional[transformers.cache_utils.Cache] = None,
cache_position: Optional[torch.LongTensor] = None,
**kwargs: transformers.processing_utils.Unpack[transformers.utils.TransformersKwargs],
) tuple[torch.Tensor, torch.Tensor]#
class nemo_automodel.components.models.qwen2.model.Qwen2DecoderLayer(
config: transformers.Qwen2Config,
layer_idx: int,
backend: nemo_automodel.components.models.common.BackendConfig,
)#

Bases: transformers.modeling_layers.GradientCheckpointingLayer

Single Qwen2 decoder layer with RMSNorm, attention, and combined MLP.

ALWAYS uses combined projections - this is the whole point of the custom implementation.

Initialization

forward(
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[transformers.cache_utils.Cache] = None,
use_cache: Optional[bool] = False,
cache_position: Optional[torch.LongTensor] = None,
position_embeddings: Optional[tuple[torch.Tensor, torch.Tensor]] = None,
**kwargs: transformers.processing_utils.Unpack[transformers.utils.TransformersKwargs],
) torch.Tensor#
class nemo_automodel.components.models.qwen2.model.Qwen2PreTrainedModel#

Bases: transformers.modeling_utils.PreTrainedModel

Abstract class for Qwen2 pretrained models.

config_class#

None

base_model_prefix#

‘model’

supports_gradient_checkpointing#

True

_no_split_modules#

[‘Qwen2DecoderLayer’]

_skip_keys_device_placement#

[‘past_key_values’]

_supports_flash_attn#

True

_supports_sdpa#

True

_supports_flex_attn#

True

_can_compile_fullgraph#

True

_supports_attention_backend#

True

_can_record_outputs#

None

class nemo_automodel.components.models.qwen2.model.Qwen2Model(
config: transformers.Qwen2Config,
backend: nemo_automodel.components.models.common.BackendConfig,
)#

Bases: nemo_automodel.components.models.qwen2.model.Qwen2PreTrainedModel

Qwen2 transformer model (embeddings + decoder layers + norm).

ALWAYS uses combined projections - this is the whole point of the custom implementation.

Initialization

forward(
input_ids: Optional[torch.LongTensor] = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[transformers.cache_utils.Cache] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
cache_position: Optional[torch.LongTensor] = None,
**kwargs: transformers.processing_utils.Unpack[transformers.utils.TransformersKwargs],
) transformers.modeling_outputs.BaseModelOutputWithPast#
class nemo_automodel.components.models.qwen2.model.Qwen2ForCausalLM(
config: transformers.Qwen2Config,
backend: Optional[nemo_automodel.components.models.common.BackendConfig] = None,
)#

Bases: nemo_automodel.components.models.common.hf_checkpointing_mixin.HFCheckpointingMixin, nemo_automodel.components.models.qwen2.model.Qwen2PreTrainedModel

Qwen2 model with causal language modeling head.

ALWAYS uses combined projections - this is the whole point of the custom implementation.

Initialization

_tied_weights_keys#

None

_tp_plan#

None

_pp_plan#

None

get_input_embeddings()#
set_input_embeddings(value)#
get_output_embeddings()#
set_output_embeddings(new_embeddings)#
forward(
input_ids: Optional[torch.LongTensor] = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[transformers.cache_utils.Cache] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
cache_position: Optional[torch.LongTensor] = None,
logits_to_keep: Union[int, torch.Tensor] = 0,
**kwargs: transformers.processing_utils.Unpack[transformers.utils.TransformersKwargs],
) transformers.modeling_outputs.CausalLMOutputWithPast#

Forward pass returning CausalLMOutputWithPast.

nemo_automodel.components.models.qwen2.model.ModelClass#

None