Multi-Latent Attention
Multi-Latent Attention (“MLA”) is an innovative attention mechanism introduced by Deepseek team that enhances the efficiency of attention computation by leveraging multiple latent spaces. This approach is particularly beneficial for large language models (LLMs), as it reduces the computational burden associated with traditional attention mechanisms. According to Deepseek-V2 technical report, MLA achieves better performance compared to Multi-Head Attention (MHA) and requires smaller KV cache.
To enable MLA in Megatron-LM, set the following flags in command line: - –multi-latent-attention to enable MLA in MLP. - Set MLATransformerConfig to configure MLA.