Positional embeddings

Positional embeddings are used to give the model information about the position of each element in a sequence. Megatron LLM supports the following positional embedding types:

GPT

Supported positional embeddings in GPT models

Parameter value

How to use

Description

learned_absolute
Copy
Copied!
            

model.position_embedding_type='learned_absolute'

Absolute Position Encodings [nlp-megatron10] are position embeddings used in Transformer-based models, added to input embeddings in the encoder and decoder sections. These encodings match the dimension of embeddings and are created using sine and cosine functions of various frequencies. Each dimension in the encoding corresponds to a sinusoid with wavelengths forming a geometric progression.
rope
Copy
Copied!
            

model.position_embedding_type='rope' model.rotary_percentage=1.0

Rotary Position Embedding (RoPE) [nlp-megatron8] incorporates positional information by utilizing a rotation matrix to encode the absolute positions of tokens while maintaining relative positional relationships in self-attention formulations by leveraging the geometric properties of vectors and complex numbers, applying a rotation based on a preset non-zero constant and the relative positions of the tokens to the word embeddings.
alibi
Copy
Copied!
            

model.position_embedding_type='alibi'

Attention with Linear Biases (ALiBi) [nlp-megatron5] modifies the way attention scores are computed in the attention sublayer of the network. ALiBi introduces a static, non-learned bias after the query-key dot product during the computation of attention scores. This bias is added in the form of a head-specific slope that is determined before training, creating a geometric sequence of slopes for the different heads in the model. The method has an inductive bias towards recency, penalizing attention scores between distant query-key pairs with the penalty increasing as the distance grows, and it leverages different rates of penalty increase across different heads based on the slope magnitude.
kerple
Copy
Copied!
            

model.position_embedding_type='kerple'

Kernelized Relative Positional Embedding for Length Extrapolation (KERPLE) [nlp-megatron2] generalizes relative positional embeddings (RPE) by kernelizing positional differences using conditionally positive definite (CPD) kernels known for generalizing distance metrics. They transform CPD kernels into positive definite (PD) kernels by adding a constant offset, which is absorbed during softmax normalization in the self-attention mechanism of transformers. This approach allows for a variety of RPEs that facilitate length extrapolation in a principled manner.
xpos
Copy
Copied!
            

model.position_embedding_type='xpos'

Extrapolatable Position Embedding (xPos) [nlp-megatron9]
sandwich
Copy
Copied!
            

model.position_embedding_type='sandwich'

Sandwich [nlp-megatron3]

T5

Supported positional embeddings in T5 models

Parameter value

How to use

Description

learned_absolute
Copy
Copied!
            

model.encoder.position_embedding_type='learned_absolute' model.decoder.position_embedding_type='learned_absolute'

Absolute Position Encodings [nlp-megatron10] are position embeddings used in Transformer-based models, added to input embeddings in the encoder and decoder sections. These encodings match the dimension of embeddings and are created using sine and cosine functions of various frequencies. Each dimension in the encoding corresponds to a sinusoid with wavelengths forming a geometric progression.
relative
Copy
Copied!
            

model.encoder.position_embedding_type='relative' model.decoder.position_embedding_type='relative'

Relative Position Representations [nlp-megatron6]
alibi
Copy
Copied!
            

model.encoder.position_embedding_type='alibi' model.decoder.position_embedding_type='alibi'

Attention with Linear Biases (ALiBi) [nlp-megatron5] modifies the way attention scores are computed in the attention sublayer of the network. ALiBi introduces a static, non-learned bias after the query-key dot product during the computation of attention scores. This bias is added in the form of a head-specific slope that is determined before training, creating a geometric sequence of slopes for the different heads in the model. The method has an inductive bias towards recency, penalizing attention scores between distant query-key pairs with the penalty increasing as the distance grows, and it leverages different rates of penalty increase across different heads based on the slope magnitude.
kerple
Copy
Copied!
            

model.encoder.position_embedding_type='kerple' model.decoder.position_embedding_type='kerple'

Kernelized Relative Positional Embedding for Length Extrapolation (KERPLE) [nlp-megatron2] generalizes relative positional embeddings (RPE) by kernelizing positional differences using conditionally positive definite (CPD) kernels known for generalizing distance metrics. They transform CPD kernels into positive definite (PD) kernels by adding a constant offset, which is absorbed during softmax normalization in the self-attention mechanism of transformers. This approach allows for a variety of RPEs that facilitate length extrapolation in a principled manner.

Position Interpolation (PI) [nlp-megatron1] is a method introduced to extend the context window sizes of Rotary Position Embedding (RoPE)-based pretrained large language models (LLMs). The central principle of PI is to reduce the position indices so that they align with the initial context window size through interpolation.

Positional Interpolation is supported in Megatron GPT SFT models. Set RoPE Interpolation factor for sequence length seq_len_interpolation_factor to enable it.

Copy
Copied!
            

model.position_embedding_type='rope' model.rotary_percentage=1.0 model.seq_len_interpolation_factor: 2

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation. 2023. arXiv:2306.15595.

[nlp-megatron2](1,2)

Ta-Chung Chi, Ting-Han Fan, Peter J. Ramadge, and Alexander I. Rudnicky. Kerple: kernelized relative positional embedding for length extrapolation. 2022. arXiv:2205.09921.

Ta-Chung Chi, Ting-Han Fan, Alexander I. Rudnicky, and Peter J. Ramadge. Dissecting transformer length extrapolation via the lens of receptive field analysis. 2023. arXiv:2212.10356.

[nlp-megatron4]

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: fast and memory-efficient exact attention with io-awareness. 2022. arXiv:2205.14135.

[nlp-megatron5](1,2)

Ofir Press, Noah A. Smith, and Mike Lewis. Train short, test long: attention with linear biases enables input length extrapolation. 2022. arXiv:2108.12409.

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. 2018. arXiv:1803.02155.

[nlp-megatron7]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: enhanced transformer with rotary position embedding. 2022. arXiv:2104.09864.

Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, and Furu Wei. A length-extrapolatable transformer. 2022. arXiv:2212.10554.

[nlp-megatron10](1,2)

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. 2023. arXiv:1706.03762.

Previous Flash attention
Next Megatron Core Customization
© Copyright 2023-2024, NVIDIA. Last updated on Apr 25, 2024.