Multi-Latent Attention#

Multi-Latent Attention Overview#

Multi-Latent Attention (MLA) is an attention variant from the DeepSeek team. It uses multiple latent spaces to change how attention is computed. That design often lowers cost for large language models (LLMs) compared with standard attention and can shrink the KV cache. The DeepSeek-V2 technical report compares MLA to Multi-Head Attention (MHA) on quality and cache size.

Enabling Multi-Latent Attention#

To enable MLA in Megatron-LM, set the following on the command line:

  • --multi-latent-attention to turn on MLA.

  • Use MLATransformerConfig for MLA-specific model settings when you build the training configuration.