A new technical paper titled "Hardware-Centric Analysis of DeepSeek's Multi-Head Latent Attention" was published by researchers at KU Leuven.
Abstract
"Multi-Head Latent Attention (MLA), introduced in DeepSeek-V2, improves the efficiency of large language models by projecting query, key, and value tensors into a compact latent space. This architectural change reduces the KV-cache size and s...
» read more