Megatron Core User Guide

Multi-Latent Attention

Multi-Latent Attention (“MLA”) is an innovative attention mechanism introduced by Deepseek team that enhances the efficiency of attention computation by leveraging multiple latent spaces. This approach is particularly beneficial for large language models (LLMs), as it reduces the computational burden associated with traditional attention mechanisms. According to Deepseek-V2 technical report, MLA achieves better performance compared to Multi-Head Attention (MHA) and requires smaller KV cache.

To enable MLA in Megatron-LM, set the following flags in command line: - –multi-latent-attention to enable MLA in MLP. - Set MLATransformerConfig to configure MLA.

Previous datasets package
Next Microbatches Calculator
© Copyright 2022-2025, NVIDIA. Last updated on Sep 16, 2025.