Four-Tier Memory Hierarchy for LLM Reasoning (USC, UW)


A new technical paper, "Not All Thoughts Need HBM: Semantics-Aware Memory Hierarchy for LLM Reasoning," was published by researchers at USC and University of Wisconsin-Madison. Abstract "Reasoning LLMs produce thousands of chain-of-thought tokens whose KV cache must reside in scarce GPU HBM. The dominant response -- permanently evicting low-importance tokens -- is catastrophic for reasoni... » read more

Pooling CPU Memory for LLM Inference With Lower Latency and Higher Throughput (UC Berkeley)


A new technical paper titled "Pie: Pooling CPU Memory for LLM Inference" was published by researchers at UC Berkeley. Abstract "The rapid growth of LLMs has revolutionized natural language processing and AI analysis, but their increasing size and memory demands present significant challenges. A common solution is to spill over to CPU memory; however, traditional GPU-CPU memory swapping ofte... » read more