Four Architectural Opportunities for LLM Inference Hardware (Google)


A new technical paper titled "Challenges and Research Directions for Large Language Model Inference Hardware" was published by Google. Abstract "Large Language Model (LLM) inference is hard. The autoregressive Decode phase of the underlying Transformer model makes LLM inference fundamentally different from training. Exacerbated by recent AI trends, the primary challenges are memory and in... » read more

Expanding Server Memory Capabilities With Multiplexed Rank DIMM (MRDIMM) Technology


The scaling of computational power within a single, packaged semiconductor component continues to rise following a Moore’s law type curve enabling new and more capable applications including machine learning (ML), generative artificial intelligence (AI), and training and deployment of large language models (LLM). On-demand lifestyle applications like language translation, direction finding, a... » read more

Cracking The Memory Wall


Processor performance continues to improve exponentially, with more processor cores, parallel instructions, and specialized processing elements, but it is far outpacing improvements in bandwidth and memory. That gap, the so-called memory wall, has persisted throughout most of this century, but now it is becoming more pronounced. SRAM scaling is slowing at advanced nodes, which means SRAM takes ... » read more

Redefining XPU Memory For AI Data Centers Through Custom HBM4: Part 1


This is the first of a three-part series on HBM4 and gives an overview of the HBM standard. Part 2 will provide insights on HBM implementation challenges, and part 3 will introduce the concept of a custom HBM implementation. Relentless growth in data consumption Recent advances in deep learning have had a transformative effect on artificial intelligence (AI) and the ever-increasing volume of ... » read more

UMI: Extending Chiplet Interconnect Standards To Deal With The Memory Wall


With the Open Compute Project (OCP) Summit upon us, it’s an appropriate time to talk about chiplet interconnect (in fact the 2024 OCP Summit has a whole day dedicated to the multi-die topic, on October 17). Of particular interest is the Bunch of Wires (BoW) interconnect specification that continues to evolve. At OCP there will be an update and working group looking at version 2.1 of BoW. (... » read more

DRAM Cache for GPUs With SCM And High Bandwidth


A new technical paper titled "Bandwidth-Effective DRAM Cache for GPUs with Storage-Class Memory" was published by researchers at POSTECH and Songsil University. Abstract "We propose overcoming the memory capacity limitation of GPUs with high-capacity Storage-Class Memory (SCM) and DRAM cache. By significantly increasing the memory capacity with SCM, the GPU can capture a larger fraction o... » read more

Designing AI Hardware To Deal With Increasingly Challenging Memory Wall (UC Berkeley)


A new technical paper titled "AI and Memory Wall" was published by researchers at UC Berkeley, ICSI, and LBNL. Abstract "The availability of unprecedented unsupervised training data, along with neural scaling laws, has resulted in an unprecedented surge in model size and compute requirements for serving/training LLMs. However, the main performance bottleneck is increasingly shifting to memo... » read more

Balancing Memory And Coherence: Navigating Modern Chip Architectures


In the intricate world of modern chip architectures, the "memory wall" – the limitations posed by external DRAM accesses on performance and power consumption growing slower than the ability to compute data – has emerged as a pivotal challenge. Architects must strike a delicate balance between leveraging local data reuse and managing external memory accesses. While caches are critical for op... » read more

CNN Hardware Architecture With Weights Generator Module That Alleviates Impact Of The Memory Wall


A technical paper titled “Mitigating Memory Wall Effects in CNN Engines with On-the-Fly Weights Generation” was published by researchers at Samsung AI Center and University of Cambridge. Abstract: "The unprecedented accuracy of convolutional neural networks (CNNs) across a broad range of AI tasks has led to their widespread deployment in mobile and embedded settings. In a pursuit for high... » read more

Moving Data And Computing Closer Together


The speed of processors has increased to the point where they often are no longer the performance bottleneck for many systems. It's now about data access. Moving data around costs both time and power, and developers are looking for ways to reduce the distances that data has to move. That means bringing data and memory nearer to each other. “Hard drives didn't have enough data flow to cr... » read more

← Older posts