Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to the Migration Guide for information on getting started.

Positional embeddings

Positional embeddings are used to give the model information about the position of each element in a sequence. Megatron LLM supports the following positional embedding types:

GPT

Supported positional embeddings in GPT models

Parameter value

How to use

Description

learned_absolute

model.position_embedding_type='learned_absolute'

Absolute Position Encodings [pos-emb8] are position embeddings used in Transformer-based models, added to input embeddings in the encoder and decoder sections. These encodings match the dimension of embeddings and are created using sine and cosine functions of various frequencies. Each dimension in the encoding corresponds to a sinusoid with wavelengths forming a geometric progression.

rope

model.position_embedding_type='rope'
model.rotary_percentage=1.0

Rotary Position Embedding (RoPE) [pos-emb6] incorporates positional information by utilizing a rotation matrix to encode the absolute positions of tokens while maintaining relative positional relationships in self-attention formulations by leveraging the geometric properties of vectors and complex numbers, applying a rotation based on a preset non-zero constant and the relative positions of the tokens to the word embeddings.

alibi

model.position_embedding_type='alibi'

Attention with Linear Biases (ALiBi) [pos-emb4] modifies the way attention scores are computed in the attention sublayer of the network. ALiBi introduces a static, non-learned bias after the query-key dot product during the computation of attention scores. This bias is added in the form of a head-specific slope that is determined before training, creating a geometric sequence of slopes for the different heads in the model. The method has an inductive bias towards recency, penalizing attention scores between distant query-key pairs with the penalty increasing as the distance grows, and it leverages different rates of penalty increase across different heads based on the slope magnitude.

kerple

model.position_embedding_type='kerple'

Kernelized Relative Positional Embedding for Length Extrapolation (KERPLE) [pos-emb2] generalizes relative positional embeddings (RPE) by kernelizing positional differences using conditionally positive definite (CPD) kernels known for generalizing distance metrics. They transform CPD kernels into positive definite (PD) kernels by adding a constant offset, which is absorbed during softmax normalization in the self-attention mechanism of transformers. This approach allows for a variety of RPEs that facilitate length extrapolation in a principled manner.

xpos

model.position_embedding_type='xpos'

Extrapolatable Position Embedding (xPos) [pos-emb7]

sandwich

model.position_embedding_type='sandwich'

Sandwich [pos-emb3]

T5

Supported positional embeddings in T5 models

Parameter value

How to use

Description

learned_absolute

model.encoder.position_embedding_type='learned_absolute'
model.decoder.position_embedding_type='learned_absolute'

Absolute Position Encodings [pos-emb8] are position embeddings used in Transformer-based models, added to input embeddings in the encoder and decoder sections. These encodings match the dimension of embeddings and are created using sine and cosine functions of various frequencies. Each dimension in the encoding corresponds to a sinusoid with wavelengths forming a geometric progression.

relative

model.encoder.position_embedding_type='relative'
model.decoder.position_embedding_type='relative'

Relative Position Representations [pos-emb5]

alibi

model.encoder.position_embedding_type='alibi'
model.decoder.position_embedding_type='alibi'

Attention with Linear Biases (ALiBi) [pos-emb4] modifies the way attention scores are computed in the attention sublayer of the network. ALiBi introduces a static, non-learned bias after the query-key dot product during the computation of attention scores. This bias is added in the form of a head-specific slope that is determined before training, creating a geometric sequence of slopes for the different heads in the model. The method has an inductive bias towards recency, penalizing attention scores between distant query-key pairs with the penalty increasing as the distance grows, and it leverages different rates of penalty increase across different heads based on the slope magnitude.

kerple

model.encoder.position_embedding_type='kerple'
model.decoder.position_embedding_type='kerple'

Kernelized Relative Positional Embedding for Length Extrapolation (KERPLE) [pos-emb2] generalizes relative positional embeddings (RPE) by kernelizing positional differences using conditionally positive definite (CPD) kernels known for generalizing distance metrics. They transform CPD kernels into positive definite (PD) kernels by adding a constant offset, which is absorbed during softmax normalization in the self-attention mechanism of transformers. This approach allows for a variety of RPEs that facilitate length extrapolation in a principled manner.

Positional interpolation

Position Interpolation (PI) [pos-emb1] is a method introduced to extend the context window sizes of Rotary Position Embedding (RoPE)-based pretrained large language models (LLMs). The central principle of PI is to reduce the position indices so that they align with the initial context window size through interpolation.

Positional Interpolation is supported in Megatron GPT SFT models. Set RoPE Interpolation factor for sequence length seq_len_interpolation_factor to enable it.

model.position_embedding_type='rope'
model.rotary_percentage=1.0
model.seq_len_interpolation_factor: 2

References

pos-emb1

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation. 2023. arXiv:2306.15595.

pos-emb2(1,2)

Ta-Chung Chi, Ting-Han Fan, Peter J. Ramadge, and Alexander I. Rudnicky. Kerple: kernelized relative positional embedding for length extrapolation. 2022. arXiv:2205.09921.

pos-emb3

Ta-Chung Chi, Ting-Han Fan, Alexander I. Rudnicky, and Peter J. Ramadge. Dissecting transformer length extrapolation via the lens of receptive field analysis. 2023. arXiv:2212.10356.

pos-emb4(1,2)

Ofir Press, Noah A. Smith, and Mike Lewis. Train short, test long: attention with linear biases enables input length extrapolation. 2022. arXiv:2108.12409.

pos-emb5

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. 2018. arXiv:1803.02155.

pos-emb6

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: enhanced transformer with rotary position embedding. 2022. arXiv:2104.09864.

pos-emb7

Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, and Furu Wei. A length-extrapolatable transformer. 2022. arXiv:2212.10554.

pos-emb8(1,2)

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. 2023. arXiv:1706.03762.