A HW-Aware Scalable Exact-Attention Execution Mechanism For GPUs (Microsoft)


A technical paper titled “Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers” was published by researchers at Microsoft. Abstract: "Transformer-based models have emerged as one of the most widely used architectures for natural language processing, natural language generation, and image generation. The size of the state-of-the-art models has in... » read more

Designing AI Hardware To Deal With Increasingly Challenging Memory Wall (UC Berkeley)


A new technical paper titled "AI and Memory Wall" was published by researchers at UC Berkeley, ICSI, and LBNL. Abstract "The availability of unprecedented unsupervised training data, along with neural scaling laws, has resulted in an unprecedented surge in model size and compute requirements for serving/training LLMs. However, the main performance bottleneck is increasingly shifting to memo... » read more