Large-scale, SRAM-based LLM Inference Deployment (Groq)


A new technical paper, "SHIP: SRAM-Based Huge Inference Pipelines for Fast LLM Serving," was published by researchers at Nvidia, with work done while at Groq. Abstract "The proliferation of large language models (LLMs) demands inference systems with both low latency and high efficiency at scale. GPU-based serving relies on HBM for model weights and KV caches, creating a memory bandwidth b... » read more

Process Variation In The Era Of Scaling: Improving Uniformity With Dummy Fill


As semiconductor patterning continues to scale, even small layout nonuniformities can lead to noticeably different process outcomes. Real chip layouts contain a mix of dense regions, large open regions, and isolated features. As a result, the etch process encounters different “local environments” across the wafer. Even with the same process settings (or recipe), some areas may etch mo... » read more

Memory Wall Gets Higher


Key Takeaways An increasing percentage of the chip area is consumed by the same amount of SRAM for each node shrink. The problem is not limited to leading-edge AI, as it will eventually impact even small MCUs and MPUs. Architectural changes may be required. Stacking SRAM chiplets on logic is possible but expensive. SRAM is a vital piece of all computing systems, but its fail... » read more

Optimizing In-Memory AI Accelerators Across Multiple Workloads (KAUST, Compumacy)


Researchers from KAUST and Compumacy for Artificial Intelligence Solutions have released “Joint Hardware-Workload Co-Optimization for In-Memory Computing Accelerators”. Abstract “Software-hardware co-design is essential for optimizing in-memory computing (IMC) hardware accelerators for neural networks. However, most existing optimization frameworks target a single workload, lea... » read more

Optimal Heterogeneous Memory Configs for AI Tasks Under Specified Performance Metrics (Stanford, UCSC)


Researchers from Stanford University and University of California, Santa Cruz have released “Heterogeneous Memory Design Exploration for AI Accelerators with a Gain Cell Memory Compiler”. Abstract “As memory increasingly dominates system cost and energy, heterogeneous on-chip memory systems that combine technologies with complementary characteristics are becoming essential. Gain ... » read more

Electrical Model of the Bitflip in SRAM Under Laser Illumination Simulating Laser Fault Injection


A new technical paper, "Electrical modelisation of a bitflip in SRAM cell memory induced by laser fault injection," was published by researchers at Univ Rennes, CNRS, IETR. Abstract "An electrical model of the bitflip in SRAM under laser illumination simulating laser fault injection is proposed. This model is based on a bipolar phototransistor responsible of the amplified induced photocur... » read more

Router-in-a-Package Design Combining HBM4, Chiplets and In-Package Optics (Technion, Berkeley, UCSD)


A new technical paper "Scaling Routers with In-Package Optics and High-Bandwidth Memories" was posted by researchers at Technion, UC Berkeley and UC San Diego. Abstract "This paper aims to apply two major scaling transformations from the computing packaging industry to internet routers: the heterogeneous integration of high-bandwidth memories (HBMs) and chiplets, as well as in-package optic... » read more

Impact Of On-Chip SRAM Size And Frequency On Energy Efficiency And Performance of LLM Inference (Uppsala Univ.)


A new technical paper titled "Prefill vs. Decode Bottlenecks: SRAM-Frequency Tradeoffs and the Memory-Bandwidth Ceiling" was published by researchers at Uppsala University. Abstract "Energy consumption dictates the cost and environmental impact of deploying Large Language Models. This paper investigates the impact of on-chip SRAM size and operating frequency on the energy efficiency and per... » read more

The Impact Of DRAM Writes On DDR5-Based Systems (Georgia Tech)


A new technical paper titled "BARD: Reducing Write Latency of DDR5 Memory by Exploiting Bank-Parallelism" was published by Georgia Tech. Abstract "This paper studies the impact of DRAM writes on DDR5-based system. To efficiently perform DRAM writes, modern systems buffer write requests and try to complete multiple write operations whenever the DRAM mode is switched from read to write. Whe... » read more

Multi-Core Architecture Optimized For Time-Predictable Neural Network Inference (FZI, KIT)


A new technical paper titled "MultiVic: A Time-Predictable RISC-V Multi-Core Processor Optimized for Neural Network Inference" was published by researchers at FZI Research Center for Information Technology and Karlsruhe Institute for Information Technology (KIT). Abstract: "Real-time systems, particularly those used in domains like automated driving, are increasingly adopting neural network... » read more

← Older posts