A new technical paper, "Not All Thoughts Need HBM: Semantics-Aware Memory Hierarchy for LLM Reasoning," was published by researchers at USC and University of Wisconsin-Madison.
Abstract
"Reasoning LLMs produce thousands of chain-of-thought tokens whose KV cache must reside in scarce GPU HBM. The dominant response -- permanently evicting low-importance tokens -- is catastrophic for reasoni...
» read more