Moving AI Workloads To The Edge


Experts At The Table: Semiconductor Engineering gathered a group of experts to discuss how some AI workloads are better suited for on-device processing to achieve consistent performance, avoid network connectivity issues, reduce cloud computing costs, and ensure privacy. The panel included Frank Ferro, group director in the Silicon Solutions Group at Cadence; Eduardo Montanez, vice president an... » read more

Heterogeneous System With Specialized HW For Disaggregated LLM Inference (Princeton Univ., Univ. of Washington)


A new technical paper titled "SPAD: Specialized Prefill and Decode Hardware for Disaggregated LLM Inference" was published by researchers at Princeton University and University of Washington. Abstract "Large Language Models (LLMs) have gained popularity in recent years, driving up the demand for inference. LLM inference is composed of two phases with distinct characteristics: a compute-boun... » read more

A Fundamental Rethinking Of Memory Hierarchy Design (Stanford University)


A new technical paper titled "The Future of Memory: Limits and Opportunities" was published by researchers at Stanford University and an independent researcher. Abstract "Memory latency, bandwidth, capacity, and energy increasingly limit performance. In this paper, we reconsider proposed system architectures that consist of huge (many-terabyte to petabyte scale) memories shared among large ... » read more

Scaling DRAM Technology To Meet Future Demands: Challenges And Opportunities


Since the invention of the 1T1C bit cell more than 50 years ago, DRAMs have become the main memory of choice for processors in computer systems and many consumer electronics devices. As new use computing paradigms have been created, including 3D graphics, cloud computing, smart phones, and AI processing, specialized processors and DRAM memories have been developed that are optimized for these u... » read more

What’s Different About HBM4


Memory bandwidth is limiting the flow of huge datasets that are needed to train AI models. There is much more data to process, store, and retrieve, but the speed at which that data moves through high-bandwidth memory (HBM) stacks is significantly lower than the speed at which data can be processed. Frank Ferro, group director for product management at Cadence, talks about the new HBM4 standard,... » read more

Expanding Server Memory Capabilities With Multiplexed Rank DIMM (MRDIMM) Technology


The scaling of computational power within a single, packaged semiconductor component continues to rise following a Moore’s law type curve enabling new and more capable applications including machine learning (ML), generative artificial intelligence (AI), and training and deployment of large language models (LLM). On-demand lifestyle applications like language translation, direction finding, a... » read more

Detailed Study of Performance Modeling For LLM Implementations At Scale (imec)


A new technical paper titled "System-performance and cost modeling of Large Language Model training and inference" was published by researchers at imec. Abstract "Large language models (LLMs), based on transformer architectures, have revolutionized numerous domains within artificial intelligence, science, and engineering due to their exceptional scalability and adaptability. However, the ex... » read more

Hardware-Oriented Analysis of Multi-Head Latent Attention (MLA) in DeepSeek-V3 (KU Leuven)


A new technical paper titled "Hardware-Centric Analysis of DeepSeek's Multi-Head Latent Attention" was published by researchers at KU Leuven. Abstract "Multi-Head Latent Attention (MLA), introduced in DeepSeek-V2, improves the efficiency of large language models by projecting query, key, and value tensors into a compact latent space. This architectural change reduces the KV-cache size and s... » read more

Arithmetic Intensity In Decoding: A Hardware-Efficient Perspective (Princeton University)


A new technical paper titled "Hardware-Efficient Attention for Fast Decoding" was published by researchers at Princeton University. Abstract "LLM decoding is bottlenecked for large batches and long contexts by loading the key-value (KV) cache from high-bandwidth memory, which inflates per-token latency, while the sequential nature of decoding limits parallelism. We analyze the interplay amo... » read more

Lines Blurring Between Supercomputing And HPC


Supercomputers and high-performance computers are becoming increasingly difficult to differentiate due to the proliferation of AI, which is driving huge performance increases in commercial and scientific applications and raising similar challenges for both. While the goals of supercomputing and high-performance computing (HPC) have always been similar — blazing fast processing — the mark... » read more

← Older posts Newer posts →