nemo_automodel.components.models.gpt_oss.rope_utils#

Module Contents#

Classes#

Functions#

apply_rotary_emb

Apply rotary embeddings to input tensor.

apply_rotary_emb_qk

Apply rotary embeddings to query and key tensors.

position_ids_to_freqs_cis

API#

nemo_automodel.components.models.gpt_oss.rope_utils.apply_rotary_emb(
x: torch.Tensor,
cos: torch.Tensor,
sin: torch.Tensor,
) torch.Tensor#

Apply rotary embeddings to input tensor.

If cos/sin have fewer dimensions than x (due to partial_rotary_factor < 1.0), only the first rotary_dim dimensions of x are rotated, and the rest are passed through.

Parameters:
  • x – Input tensor (…, head_dim)

  • cos – Cosine tensor (…, rotary_dim // 2)

  • sin – Sine tensor (…, rotary_dim // 2)

class nemo_automodel.components.models.gpt_oss.rope_utils.RotaryEmbedding(
head_dim: int,
base: int,
dtype: torch.dtype,
initial_context_length: int = 4096,
scaling_factor: float = 1.0,
ntk_alpha: float = 1.0,
ntk_beta: float = 32.0,
partial_rotary_factor: float = 1.0,
device: torch.device | None = None,
)#

Bases: torch.nn.Module

Initialization

_compute_concentration_and_inv_freq() torch.Tensor#

See YaRN paper: https://arxiv.org/abs/2309.00071

Uses rotary_dim instead of head_dim to support partial rotary embeddings.

_compute_cos_sin(num_tokens: int)#
forward(
query: torch.Tensor,
key: torch.Tensor,
) tuple[torch.Tensor, torch.Tensor]#
nemo_automodel.components.models.gpt_oss.rope_utils.apply_rotary_emb_qk(
q: torch.Tensor,
k: torch.Tensor,
freqs_cis: torch.Tensor,
format: str = 'bshd',
rope_fusion: bool = True,
cu_seqlens: torch.Tensor | None = None,
concentration: float | None = None,
cp_size: int = 1,
cp_rank: int = 0,
) tuple[torch.Tensor, torch.Tensor]#

Apply rotary embeddings to query and key tensors.

Parameters:
  • q – Query tensor.

  • k – Key tensor.

  • freqs_cis –

    Frequency tensor. Format depends on rope_fusion:

    • If rope_fusion=True: [angles, angles] for TE fused rope

    • If rope_fusion=False: [cos, sin] with concentration applied

  • format – QKV format (β€œbshd” or β€œthd”).

  • rope_fusion – If True, use TE fused rope. If False, use non-fused rope.

  • cu_seqlens – Cumulative sequence lengths for variable-length sequences.

  • cp_size – Context parallelism size.

  • cp_rank – Context parallelism rank.

Returns:

Tuple of (q, k) with rotary embeddings applied.

nemo_automodel.components.models.gpt_oss.rope_utils.position_ids_to_freqs_cis(
rotary_emb: nemo_automodel.components.models.gpt_oss.rope_utils.RotaryEmbedding,
position_ids: torch.Tensor,
qkv_format: str = 'bshd',
for_fused_rope: bool = True,
cp_size: int = 1,
) torch.Tensor#