Stacking Persistent Embedded Memories Based On Oxide Transistors Upon GPGPU Platforms (Georgia Tech)


A new technical paper titled "CMOS+X: Stacking Persistent Embedded Memories based on Oxide Transistors upon GPGPU Platforms" was published by Georgia Tech. Abstract "In contemporary general-purpose graphics processing units (GPGPUs), the continued increase in raw arithmetic throughput is constrained by the capabilities of the register file (single-cycle) and last-level cache (high bandwidth... » read more

Review Paper: Wafer-Scale Accelerators Versus GPUs (UC Riverside)


A new technical paper titled "Performance, efficiency, and cost analysis of wafer-scale AI accelerators vs. single-chip GPUs" was published by researchers at UC Riverside. "This review compares wafer-scale AI accelerators and single-chip GPUs, examining performance, energy efficiency, and cost in high-performance AI applications. It highlights enabling technologies like TSMC’s chip-on-wafe... » read more

Arithmetic Intensity In Decoding: A Hardware-Efficient Perspective (Princeton University)


A new technical paper titled "Hardware-Efficient Attention for Fast Decoding" was published by researchers at Princeton University. Abstract "LLM decoding is bottlenecked for large batches and long contexts by loading the key-value (KV) cache from high-bandwidth memory, which inflates per-token latency, while the sequential nature of decoding limits parallelism. We analyze the interplay amo... » read more

Embarrassingly Parallel Problems: Definitions, Challenges And Solutions


One of the reasons GPUs are regularly discussed in the same breath as AI is that AI shares the same fundamental class of problems as 3D graphics. They are both embarrassingly parallel. Embarrassingly parallel problems refer to computational tasks that: Exhibit independence: Subtasks do not rely on intermediate results from other tasks. Require minimal interaction: Parallel task... » read more

TCAD For GPUs And GPUs For TCAD


It is well known that many steps in chip development become exponentially harder as feature sizes shrink and instance counts balloon. Billions of transistors are now commonplace, and wafer-scale devices with trillions are on the horizon. Such massive chips put pressure on every electronic design automation (EDA) tool in the development flow, from front-end architectural modeling to signoff and ... » read more

Memory Wall Problem Grows With LLMs


The growing imbalance between the amount of data that needs to be processed to train large language models (LLMs) and the inability to move that data back and forth fast enough between memories and processors has set off a massive global search for a better and more energy- and cost-efficient solution. Much of this is evident in the numbers. The GPU market is forecast to reach $190 billion in ... » read more

Striking A Balance On Efficiency, Performance, And Cost


Experts at the Table: Semiconductor Engineering sat down to discuss power-related issues such as voltage droop, application-specific processing elements, the impact of physical effects in advanced packaging, and the benefits of backside power delivery, with Hans Yeager, senior principal engineer, architecture, at Tenstorrent; Joe Davis, senior director for Calibre interfaces and EM/IR product m... » read more

Managing kW Power Budgets


Experts at the Table: Semiconductor Engineering sat down to discuss increasing power demands and how to address it with Hans Yeager, senior principal engineer, architecture, at Tenstorrent; Joe Davis, senior director for Calibre interfaces and EM/IR product management at Siemens EDA; Mo Faisal, CEO of Movellus; Trey Roessig, CTO and senior vice president of engineering at Empower Semiconductor.... » read more

Opportunities Grow For GPU Acceleration


Experts at the Table: Semiconductor Engineering sat down to discuss the impact of GPU acceleration on mask design and production and other process technologies, with Aki Fujimura, CEO of D2S; Youping Zhang, head of ASML Brion; Yalin Xiong, senior vice president and general manager of the BBP and reticle products division at KLA; and Kostas Adam, vice president of engineering at Synopsys. W... » read more

A HW-Aware Scalable Exact-Attention Execution Mechanism For GPUs (Microsoft)


A technical paper titled “Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers” was published by researchers at Microsoft. Abstract: "Transformer-based models have emerged as one of the most widely used architectures for natural language processing, natural language generation, and image generation. The size of the state-of-the-art models has in... » read more

← Older posts