A HW-Aware Scalable Exact-Attention Execution Mechanism For GPUs (Microsoft)


A technical paper titled “Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers” was published by researchers at Microsoft. Abstract: "Transformer-based models have emerged as one of the most widely used architectures for natural language processing, natural language generation, and image generation. The size of the state-of-the-art models has in... » read more