Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to the Migration Guide for information on getting started.
Positional embeddings
Positional embeddings are used to give the model information about the position of each element in a sequence. Megatron LLM supports the following positional embedding types:
GPT
Parameter value |
How to use |
Description |
---|---|---|
learned_absolute |
model.position_embedding_type='learned_absolute'
|
Absolute Position Encodings [pos-emb8] are position embeddings used in Transformer-based models, added to input embeddings in the encoder and decoder sections. These encodings match the dimension of embeddings and are created using sine and cosine functions of various frequencies. Each dimension in the encoding corresponds to a sinusoid with wavelengths forming a geometric progression. |
rope |
model.position_embedding_type='rope'
model.rotary_percentage=1.0
|
Rotary Position Embedding (RoPE) [pos-emb6] incorporates positional information by utilizing a rotation matrix to encode the absolute positions of tokens while maintaining relative positional relationships in self-attention formulations by leveraging the geometric properties of vectors and complex numbers, applying a rotation based on a preset non-zero constant and the relative positions of the tokens to the word embeddings. |
alibi |
model.position_embedding_type='alibi'
|
Attention with Linear Biases (ALiBi) [pos-emb4] modifies the way attention scores are computed in the attention sublayer of the network. ALiBi introduces a static, non-learned bias after the query-key dot product during the computation of attention scores. This bias is added in the form of a head-specific slope that is determined before training, creating a geometric sequence of slopes for the different heads in the model. The method has an inductive bias towards recency, penalizing attention scores between distant query-key pairs with the penalty increasing as the distance grows, and it leverages different rates of penalty increase across different heads based on the slope magnitude. |
kerple |
model.position_embedding_type='kerple'
|
Kernelized Relative Positional Embedding for Length Extrapolation (KERPLE) [pos-emb2] generalizes relative positional embeddings (RPE) by kernelizing positional differences using conditionally positive definite (CPD) kernels known for generalizing distance metrics. They transform CPD kernels into positive definite (PD) kernels by adding a constant offset, which is absorbed during softmax normalization in the self-attention mechanism of transformers. This approach allows for a variety of RPEs that facilitate length extrapolation in a principled manner. |
xpos |
model.position_embedding_type='xpos'
|
Extrapolatable Position Embedding (xPos) [pos-emb7] |
sandwich |
model.position_embedding_type='sandwich'
|
Sandwich [pos-emb3] |
T5
Parameter value |
How to use |
Description |
---|---|---|
learned_absolute |
model.encoder.position_embedding_type='learned_absolute'
model.decoder.position_embedding_type='learned_absolute'
|
Absolute Position Encodings [pos-emb8] are position embeddings used in Transformer-based models, added to input embeddings in the encoder and decoder sections. These encodings match the dimension of embeddings and are created using sine and cosine functions of various frequencies. Each dimension in the encoding corresponds to a sinusoid with wavelengths forming a geometric progression. |
relative |
model.encoder.position_embedding_type='relative'
model.decoder.position_embedding_type='relative'
|
Relative Position Representations [pos-emb5] |
alibi |
model.encoder.position_embedding_type='alibi'
model.decoder.position_embedding_type='alibi'
|
Attention with Linear Biases (ALiBi) [pos-emb4] modifies the way attention scores are computed in the attention sublayer of the network. ALiBi introduces a static, non-learned bias after the query-key dot product during the computation of attention scores. This bias is added in the form of a head-specific slope that is determined before training, creating a geometric sequence of slopes for the different heads in the model. The method has an inductive bias towards recency, penalizing attention scores between distant query-key pairs with the penalty increasing as the distance grows, and it leverages different rates of penalty increase across different heads based on the slope magnitude. |
kerple |
model.encoder.position_embedding_type='kerple'
model.decoder.position_embedding_type='kerple'
|
Kernelized Relative Positional Embedding for Length Extrapolation (KERPLE) [pos-emb2] generalizes relative positional embeddings (RPE) by kernelizing positional differences using conditionally positive definite (CPD) kernels known for generalizing distance metrics. They transform CPD kernels into positive definite (PD) kernels by adding a constant offset, which is absorbed during softmax normalization in the self-attention mechanism of transformers. This approach allows for a variety of RPEs that facilitate length extrapolation in a principled manner. |
Positional interpolation
Position Interpolation (PI) [pos-emb1] is a method introduced to extend the context window sizes of Rotary Position Embedding (RoPE)-based pretrained large language models (LLMs). The central principle of PI is to reduce the position indices so that they align with the initial context window size through interpolation.
Positional Interpolation is supported in Megatron GPT SFT models. Set RoPE Interpolation factor for sequence length seq_len_interpolation_factor
to enable it.
model.position_embedding_type='rope'
model.rotary_percentage=1.0
model.seq_len_interpolation_factor: 2
References
- pos-emb1
Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation. 2023. arXiv:2306.15595.
- pos-emb2(1,2)
Ta-Chung Chi, Ting-Han Fan, Peter J. Ramadge, and Alexander I. Rudnicky. Kerple: kernelized relative positional embedding for length extrapolation. 2022. arXiv:2205.09921.
- pos-emb3
Ta-Chung Chi, Ting-Han Fan, Alexander I. Rudnicky, and Peter J. Ramadge. Dissecting transformer length extrapolation via the lens of receptive field analysis. 2023. arXiv:2212.10356.
- pos-emb4(1,2)
Ofir Press, Noah A. Smith, and Mike Lewis. Train short, test long: attention with linear biases enables input length extrapolation. 2022. arXiv:2108.12409.
- pos-emb5
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. 2018. arXiv:1803.02155.
- pos-emb6
Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: enhanced transformer with rotary position embedding. 2022. arXiv:2104.09864.
- pos-emb7
Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, and Furu Wei. A length-extrapolatable transformer. 2022. arXiv:2212.10554.
- pos-emb8(1,2)
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. 2023. arXiv:1706.03762.