core.models.common.embeddings.rotary_pos_embedding#

Module Contents#

Classes#

RotaryEmbedding

Rotary Embedding for language model.

MultimodalRotaryEmbedding

Multimodal Rotary Embedding for language model. Based on https://github.com/alibaba/Pai-Megatron-Patch/blob/ efa5a752e845267936db9ae7df1b6aba92e9ff9a/megatron_patch/model/qwen2_vl/rotary_pos_embedding.py Copyright (c) 2025 alibaba/Pai-Megatron-Patch. Apache 2.0 license.

Data#

API#

core.models.common.embeddings.rotary_pos_embedding.logger#

‘getLogger(…)’

core.models.common.embeddings.rotary_pos_embedding.__all__#

[‘RotaryEmbedding’, ‘MultimodalRotaryEmbedding’]

class core.models.common.embeddings.rotary_pos_embedding.RotaryEmbedding(
kv_channels: int,
rotary_percent: float,
rotary_interleaved: bool = False,
seq_len_interpolation_factor: float = None,
rotary_base: int = 10000,
rope_scaling: bool = False,
rope_scaling_factor: float = 8.0,
use_cpu_initialization: bool = False,
cp_group: Optional[torch.distributed.ProcessGroup] = None,
)#

Bases: torch.nn.Module

Rotary Embedding for language model.

Parameters:
  • kv_channels (int) – Projection weights dimension in multi-head attention. Obtained from transformer config

  • rotary_percent (float) – Percent of rotary dimension to use for rotary position embeddings.

  • rotary_interleaved (bool, optional) – If True, interleaved rotary position embeddings. Defaults to False.

  • seq_len_interpolation_factor (float, optional) – scale of linearly interpolating RoPE for longer sequences. The value must be a float larger than 1.0. Defaults to None

  • rotary_base (int, optional) – Base period for rotary position embeddings. Defaults to 10000.

  • rope_scaling (bool, optional) – Apply rope scaling as used in llama 3.x.

  • rope_scaling_factor (float, optional) – rope scaling factor in llama 3.x. Defaults to 8.

  • use_cpu_initialization (bool, optional) – If False, initialize the inv_freq directly on the GPU. Defaults to False

  • cp_group (torch.distributed.ProcessGroup, optional) – Process group for context parallel. Defaults to None.

Initialization

_apply_scaling(
freqs,
factor=8,
low_freq_factor=1,
high_freq_factor=4,
original_max_position_embeddings=8192,
)#
get_freqs_non_repeated(
max_seq_len: int,
offset: int = 0,
) torch.Tensor#

Generates matrix of frequencies based on positions in the sequence, used to create positional encodings

get_cos_sin(
max_seq_len: int,
offset: int = 0,
)#

Cosine and sine values for RoPE are precomputed for all positions up to the maximum sequence length

forward(
max_seq_len: int,
offset: int = 0,
packed_seq: bool = False,
) torch.Tensor#

Forward pass of RoPE embedding.

Parameters:
  • max_seq_len (int) – Maximum size of sequence

  • offset (int, optional) – RoPE offset. Defaults to 0.

  • packed_seq (bool, optional) – Whether to use packed sequence. Defaults to False.

Returns:

Embeddings after applying RoPE.

Return type:

Tensor

_load_from_state_dict(state_dict, prefix, *args, **kwargs)#
get_rotary_seq_len(
inference_context: megatron.core.inference.contexts.BaseInferenceContext,
transformer: megatron.core.transformer.transformer_block.TransformerBlock,
transformer_input: torch.Tensor,
transformer_config: megatron.core.transformer.transformer_config.TransformerConfig,
packed_seq_params: Optional[megatron.core.packed_seq_params.PackedSeqParams] = None,
*,
inference_params: Optional[megatron.core.inference.contexts.BaseInferenceContext] = None,
) int#

Function to get the rotary sequence length.

Parameters:
  • inference_context – Used during Inference time

  • transformer (TransformerBlock) – The transformer block (decoder/encoder) used by the model

  • transformer_input (Tensor) – Input tensor to the transformer

  • transformer_config (TransformerConfig) – Transformer config used by the model

  • packed_seq_params (PackedSeqParams) – Packed sequence params

Returns:

The rotary sequence length

Return type:

int

class core.models.common.embeddings.rotary_pos_embedding.MultimodalRotaryEmbedding(
kv_channels: int,
rotary_percent: float,
rotary_interleaved: bool = False,
seq_len_interpolation_factor: Optional[float] = None,
rotary_base: int = 10000,
cp_group: Optional[torch.distributed.ProcessGroup] = None,
)#

Bases: torch.nn.Module

Multimodal Rotary Embedding for language model. Based on https://github.com/alibaba/Pai-Megatron-Patch/blob/ efa5a752e845267936db9ae7df1b6aba92e9ff9a/megatron_patch/model/qwen2_vl/rotary_pos_embedding.py Copyright (c) 2025 alibaba/Pai-Megatron-Patch. Apache 2.0 license.

Parameters:
  • kv_channels (int) – Projection weights dimension in multi-head attention. Obtained from transformer config

  • rotary_percent (float) – Percent of rotary dimension to use for rotary position embeddings.

  • rotary_interleaved (bool, optional) – If True, interleaved rotary position embeddings. Defaults to False.

  • seq_len_interpolation_factor (float, optional) – scale of linearly interpolating RoPE for longer sequences. The value must be a float larger than 1.0. Defaults to None

  • rotary_base (int, optional) – Base period for rotary position embeddings. Defaults to 10000.

Initialization

forward(
position_ids: torch.Tensor,
mrope_section: List[int],
) torch.Tensor#

Forward pass of multimodal RoPE embedding.

Parameters:
  • position_ids (torch.Tensor) – A postion_id tensor with shape [3, batchsize, seqlens]

  • mrope_section (list[int]) – Multimodal rope section is for channel dimension of temporal, height and width in rope calculation.

Returns:

Embeddings after applying RoPE.

Return type:

Tensor