Home

TECHNICAL PAPERS

Four-Tier Memory Hierarchy for LLM Reasoning (USC, UW)

May 20th, 2026 - By: Technical Paper Link

A new technical paper, “Not All Thoughts Need HBM: Semantics-Aware Memory Hierarchy for LLM Reasoning,” was published by researchers at USC and University of Wisconsin-Madison.

Abstract

“Reasoning LLMs produce thousands of chain-of-thought tokens whose KV cache must reside in scarce GPU HBM. The dominant response — permanently evicting low-importance tokens — is catastrophic for reasoning: accuracy collapses to 0-2.5% when half the cache is removed. We ask a different question: must every token live in HBM, or can some live elsewhere? We introduce a semantics-aware memory hierarchy that sorts tokens into four tiers — HBM, DDR, compressed, and evicted — using cumulative attention scoring. Low-importance tokens are moved to CPU memory rather than destroyed; before each attention step they are prefetched back at full precision, contributing exactly the same terms as if they had never left the GPU. We formalize this as zero-approximation-error offloading and derive our central finding: accuracy depends solely on how many tokens are permanently discarded (the eviction ratio), not on how many remain in HBM. A controlled 3×3 grid over HBM and eviction ratios confirms this across three model scales (7B-32B) and four benchmarks. With only 3% eviction, the hierarchy retains 91% of full-cache accuracy on GSM8K and 71% on MATH-500 (n=200); at 14B scale it matches the uncompressed baseline (90% vs. 86%) while halving HBM occupancy. A head-to-head reproduction of R-KV — the current SOTA eviction method — on our setup achieves only 0-32% at comparable budgets. A system prototype with real GPU-CPU data movement shows that the price of this preservation is modest — 5-7% transfer overhead — and scaling analysis projects 2-48 GB HBM savings at production batch sizes.”

Find the technical paper here. May 2026. Preprint.

Yuan, Aojie, Tianqi Shen, and Dajun Zhang. “Not All Thoughts Need HBM: Semantics-Aware Memory Hierarchy for LLM Reasoning.” arXiv preprint arXiv:2605.09490 (2026).

Four-Tier Memory Hierarchy for LLM Reasoning (USC, UW)

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers
Entities, people and technologies explored

Related Articles

Startup Funding: Q1 2026

All AI Data Center Interconnects Will Be Optical Within 5 Years

The Sub-2nm Paradox

When Semiconductor Materials Misbehave

TSMC Tech Symposium 2026, By The Numbers

Silicon Photonics Lights The Way To More Efficient Data Centers

Memory Wall Gets Higher

TSV Complexity Leads To Manufacturing Bottleneck

Sponsors

Recent Comments

About

Navigation

Connect With Us

Four-Tier Memory Hierarchy for LLM Reasoning (USC, UW)

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers Entities, people and technologies explored

Related Articles

Startup Funding: Q1 2026

All AI Data Center Interconnects Will Be Optical Within 5 Years

The Sub-2nm Paradox

When Semiconductor Materials Misbehave

TSMC Tech Symposium 2026, By The Numbers

Silicon Photonics Lights The Way To More Efficient Data Centers

Memory Wall Gets Higher

TSV Complexity Leads To Manufacturing Bottleneck

Sponsors

Newsletter Signup

Popular Tags

Recent Comments

About

Navigation

Connect With Us

Knowledge Centers
Entities, people and technologies explored