Characterization of GPU-based Inference for Reasoning-Centric LLMs (Micron, Argonne)


Researchers from Micron Technology and Argonne National Laboratory have released “Understanding Inference Scaling for LLMs: Bottlenecks, Trade-offs, and Performance Principles”. Abstract “The transition from standard generative AI to reasoning-centric architectures, exemplified by models capable of extensive Chain-of-Thought (CoT) processing, marks a fundamental paradigm shift i... » read more

Four-Tier Memory Hierarchy for LLM Reasoning (USC, UW)


A new technical paper, "Not All Thoughts Need HBM: Semantics-Aware Memory Hierarchy for LLM Reasoning," was published by researchers at USC and University of Wisconsin-Madison. Abstract "Reasoning LLMs produce thousands of chain-of-thought tokens whose KV cache must reside in scarce GPU HBM. The dominant response -- permanently evicting low-importance tokens -- is catastrophic for reasoni... » read more

What Do LLMs Want from Hardware


Figure 1: Noam Shazeer, Google Gemini vice president, presented this in his Hot Chips 2025 talk. Noam Shazeer is Google’s vice president of engineering for Gemini, their LLM competitor to ChatGPT. He talked recently at Hot Chips: “Predictions for the Next Phase of AI." He has worked on LLMs for a decade since inventing the transformer model in 2017. As his slide says, LLMs can take adv... » read more

Dynamic KV Cache Scheduling in Heterogeneous Memory Systems for LLM Inference (Rensselaer Polytechnic Institute, IBM)


A new technical paper titled "Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System" was published by researchers at Rensselaer Polytechnic Institute and IBM. Abstract "Large Language Model (LLM) inference is increasingly constrained by memory bandwidth, with frequent access to the key-value (KV) cache dominating data movement. While attention sparsity red... » read more

Cache Side-Channel Attacks On LLMs (MITRE, WPI)


A new technical paper titled "Spill The Beans: Exploiting CPU Cache Side-Channels to Leak Tokens from Large Language Models" was published by researchers at MITRE and Worcester Polytechnic Institute. Abstract "Side-channel attacks on shared hardware resources increasingly threaten confidentiality, especially with the rise of Large Language Models (LLMs). In this work, we introduce Spill The... » read more