Flash attention

NVIDIA Docs Hub NVIDIA NeMo Framework User Guide Flash attention

Flash Attention [nlp-megatron4] is a method designed to enhance the efficiency of Transformer models, which are widely utilized in applications such as natural language processing. Traditional Transformers are slow and consume a lot of memory, especially with long sequences, due to the quadratic time and memory complexity of self-attention. FlashAttention, an IO-aware exact attention algorithm that leverages tiling to minimize the number of memory reads/writes between the GPU’s high bandwidth memory (HBM) and on-chip SRAM. This approach is designed to be more efficient in terms of IO complexity compared to standard attention mechanisms.

GPT

To enable Flash Attention while Megatron GPT model training or fine-tuning, modify the following configuration:

Copy
Copied!

            
            model.use_flash_attention=True

T5

To enable Flash Attention while Megatron T5 model training, modify the following configuration:

Copy
Copied!

            
            model.encoder.use_flash_attention=True
model.decoder.use_flash_attention=True

References

[nlp-megatron1]

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation. 2023. arXiv:2306.15595.

[nlp-megatron2]

Ta-Chung Chi, Ting-Han Fan, Peter J. Ramadge, and Alexander I. Rudnicky. Kerple: kernelized relative positional embedding for length extrapolation. 2022. arXiv:2205.09921.

[nlp-megatron3]

Ta-Chung Chi, Ting-Han Fan, Alexander I. Rudnicky, and Peter J. Ramadge. Dissecting transformer length extrapolation via the lens of receptive field analysis. 2023. arXiv:2212.10356.

[nlp-megatron4]

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: fast and memory-efficient exact attention with io-awareness. 2022. arXiv:2205.14135.

[nlp-megatron5]

Ofir Press, Noah A. Smith, and Mike Lewis. Train short, test long: attention with linear biases enables input length extrapolation. 2022. arXiv:2108.12409.

[nlp-megatron6]

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. 2018. arXiv:1803.02155.

[nlp-megatron7]

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.

[nlp-megatron8]

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: enhanced transformer with rotary position embedding. 2022. arXiv:2104.09864.

[nlp-megatron9]

Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, and Furu Wei. A length-extrapolatable transformer. 2022. arXiv:2212.10554.

[nlp-megatron10]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. 2023. arXiv:1706.03762.

Previous Supported PEFT methods

Next Positional embeddings