Home

TECHNICAL PAPERS

A HW-Aware Scalable Exact-Attention Execution Mechanism For GPUs (Microsoft)

May 21st, 2024 - By: Technical Paper Link

A technical paper titled “Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers” was published by researchers at Microsoft.

Abstract:

“Transformer-based models have emerged as one of the most widely used architectures for natural language processing, natural language generation, and image generation. The size of the state-of-the-art models has increased steadily reaching billions of parameters. These huge models are memory hungry and incur significant inference latency even on cutting edge AI-accelerators, such as GPUs. Specifically, the time and memory complexity of the attention operation is quadratic in terms of the total context length, i.e., prompt and output tokens. Thus, several optimizations such as key-value tensor caching and FlashAttention computation have been proposed to deliver the low latency demands of applications relying on such large models. However, these techniques do not cater to the computationally distinct nature of different phases during inference.
To that end, we propose LeanAttention, a scalable technique of computing self-attention for the token-generation phase (decode-phase) of decoder-only transformer models. LeanAttention enables scaling the attention mechanism implementation for the challenging case of long context lengths by re-designing the execution flow for the decode-phase. We identify that the associative property of online softmax can be treated as a reduction operation thus allowing us to parallelize the attention computation over these large context lengths. We extend the “stream-K” style reduction of tiled calculation to self-attention to enable parallel computation resulting in an average of 2.6x attention execution speedup over FlashAttention-2 and up to 8.33x speedup for 512k context lengths.”

Find the technical paper here. Published May 2024.

Sanovar, Rya, Srikant Bharadwaj, Renée St. Amant, Victor Ruhle and S. Rajmohan. “Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers.” (2024). arXiv:2405.10480v1.

Related Reading
AI Transformer Models Enable Machine Vision Object Detection
A system-level view of machine vision will be essential to move the technology forward.
Processor Tradeoffs For AI Workloads
Gaps are widening between technology advances and demands, and closing them is becoming more difficult.

Knowledge Centers
Entities, people and technologies explored

Startup Funding: Q1 2025

AI chips and data center communications see big funding; 75 startups raise $2 billion.

by Jesse Allen

Advanced Packaging Fundamentals for Semiconductor Engineers

New SE eBook examines the next phase of semiconductor design, testing, and manufacturing.

by Bryon Moyer

Chip Industry Week in Review

AI export rule to be scrapped; SEMI, EU request; Cadence, Nvidia supercomputer; AI co-processor; Imagination's new GPU; semi sales up; imec, TNO photonics lab; NSF key to national security; flexible packaging control system; SiConic test engineering; USB 4 support; SiC JFETS; magnetic behavior in hematite.

by The SE Staff

Chip Industry Week in Review

EDA export controls; Synopsys-Ansys divest requirements; SIA Factbook; McKinsey effects of tariffs; ASE's fan-out bridge; earnings; TSMC's design center; China's legacy chips play; AMD's optical acquisition.

by The SE Staff

What Exactly Are Chiplets And Heterogeneous Integration?

New technologies drive new terminology, but the early days for those new approaches can be very confusing.

by Bryon Moyer

Big Changes Ahead For Interposers And Substrates

New materials and processes will help with power distribution and thermal dissipation in advanced packages.

by Gregory Haley

RISC-V’s Increasing Influence

Does the world need another CPU architecture when that no longer reflects the typical workload? Perhaps not, but it may need a bridge to get to where it needs to be.

by Brian Bailey

Chip Industry Week in Review

IC, AI global ranking; China's fully automated IC design system; Micron goes bigger; PCIe 7.0 spec; TSMC-Tokyo joint lab; panel-level packaging win; first neuromorphic compute system; GAA forksheets; AMD's new GPUs.

by The SE Staff

A HW-Aware Scalable Exact-Attention Execution Mechanism For GPUs (Microsoft)

Abstract:

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers
Entities, people and technologies explored

Related Articles

Startup Funding: Q1 2025

Advanced Packaging Fundamentals for Semiconductor Engineers

Chip Industry Week in Review

Chip Industry Week in Review

What Exactly Are Chiplets And Heterogeneous Integration?

Big Changes Ahead For Interposers And Substrates

RISC-V’s Increasing Influence

Chip Industry Week in Review

Sponsors

Recent Comments

About

Navigation

Connect With Us

A HW-Aware Scalable Exact-Attention Execution Mechanism For GPUs (Microsoft)

Abstract:

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers Entities, people and technologies explored

Related Articles

Startup Funding: Q1 2025

Advanced Packaging Fundamentals for Semiconductor Engineers

Chip Industry Week in Review

Chip Industry Week in Review

What Exactly Are Chiplets And Heterogeneous Integration?

Big Changes Ahead For Interposers And Substrates

RISC-V’s Increasing Influence

Chip Industry Week in Review

Sponsors

Newsletter Signup

Popular Tags

Recent Comments

About

Navigation

Connect With Us

Knowledge Centers
Entities, people and technologies explored