Heterogeneous System With Specialized HW For Disaggregated LLM Inference (Princeton Univ., Univ. of Washington)


A new technical paper titled "SPAD: Specialized Prefill and Decode Hardware for Disaggregated LLM Inference" was published by researchers at Princeton University and University of Washington. Abstract "Large Language Models (LLMs) have gained popularity in recent years, driving up the demand for inference. LLM inference is composed of two phases with distinct characteristics: a compute-boun... » read more

Interconnect Innovations In High Bandwidth Memory: Part 2


By Damon Tsai, Woo Young Han, and Tim Kryman Interconnect technology in high bandwidth memory (HBM) is at a fork in the road. One direction leads to tried-and-true microbump technology, and the other leads to a compelling alternative, hybrid bonding. Both technologies are evolving to address the stringent requirements of next generation HBM in pursuit of increased I/O density supporting high... » read more

On-Package Memory With UCIe To Improve Bandwidth Density And Power Efficiency (AMD, Intel Corp.)


A new technical paper titled "On-Package Memory with Universal Chiplet Interconnect Express (UCIe): A Low Power, High Bandwidth, Low Latency and Low Cost Approach" was published by researchers at Intel Corporation and AMD. Abstract "Emerging computing applications such as Artificial Intelligence (AI) are facing a memory wall with existing on-package memory solutions that are unable to meet ... » read more

HBM4 Memory: Break Through to Greater Bandwidth


Delivering unrivaled memory bandwidth in a compact, high-capacity footprint, has made HBM the memory of choice for AI training. HBM4 is the fourth major generation of the HBM standard, with new power management and RAS features. The Rambus HBM4 Controller provides industry-leading performance to 10.0 Gb/s, enabling a memory throughput of over 2.5 TB/s for training systems, generative AI and oth... » read more

3D-Stacked HBM Architecture Susceptibility To Thermal Attacks (NC A&T State, New Mexico State)


A new technical paper titled "On the Thermal Vulnerability of 3D-Stacked High-Bandwidth Memory Architectures" was published by researchers at North Carolina A&T State University and New Mexico State University. Abstract "3D-stacked High Bandwidth Memory (HBM) architectures provide high-performance memory interactions to address the well-known performance challenge, namely the memory wal... » read more

Interconnect Innovations In High Bandwidth Memory: Part 1


By Damon Tsai, Woo Young Han, and Tim Kryman The demand for high bandwidth memory (HBM) is accelerating across the semiconductor industry, driven by boundary-pushing artificial intelligence, high-performance computing, and advanced graphics. These technologies require access to vast datasets, which in turn increases the need for memory solutions that combine speed, density, and power efficie... » read more

The Evolution of DRAM


DRAM has been around since 1966, but today it's still the same basic 1T 1C bit cell architecture. Yet changes are coming as DRAM is called upon to store and retrieve more data faster. Steve Woo, distinguished inventor and fellow at Rambus, talks about how DRAM works, why there are different flavors, the impact of cooling new solutions in denser configurations, and ongoing issues involving the s... » read more

What Do LLMs Want from Hardware


Figure 1: Noam Shazeer, Google Gemini vice president, presented this in his Hot Chips 2025 talk. Noam Shazeer is Google’s vice president of engineering for Gemini, their LLM competitor to ChatGPT. He talked recently at Hot Chips: “Predictions for the Next Phase of AI." He has worked on LLMs for a decade since inventing the transformer model in 2017. As his slide says, LLMs can take adv... » read more

Challenges In Stacking HBM


AI data centers are pushing for higher density in high-bandwidth memory. Today, the maximum number of layers that can be stacked is 8, but that increases to as many as 24 layers by 2030. The big challenge will be in the interconnects, and making sure the microbumps align. At 16 layers, the bump pitch will be less than 10 microns, and the dies will be thinner. Damon Tsai, head of product marketi... » read more

Dynamic KV Cache Scheduling in Heterogeneous Memory Systems for LLM Inference (Rensselaer Polytechnic Institute, IBM)


A new technical paper titled "Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System" was published by researchers at Rensselaer Polytechnic Institute and IBM. Abstract "Large Language Model (LLM) inference is increasingly constrained by memory bandwidth, with frequent access to the key-value (KV) cache dominating data movement. While attention sparsity red... » read more

← Older posts Newer posts →