Characterization of GPU-based Inference for Reasoning-Centric LLMs (Micron, Argonne)


Researchers from Micron Technology and Argonne National Laboratory have released “Understanding Inference Scaling for LLMs: Bottlenecks, Trade-offs, and Performance Principles”. Abstract “The transition from standard generative AI to reasoning-centric architectures, exemplified by models capable of extensive Chain-of-Thought (CoT) processing, marks a fundamental paradigm shift i... » read more

Large-scale, SRAM-based LLM Inference Deployment (Groq)


A new technical paper, "SHIP: SRAM-Based Huge Inference Pipelines for Fast LLM Serving," was published by researchers at Nvidia, with work done while at Groq. Abstract "The proliferation of large language models (LLMs) demands inference systems with both low latency and high efficiency at scale. GPU-based serving relies on HBM for model weights and KV caches, creating a memory bandwidth b... » read more

A Detailed Evaluation of A Production Server With High-End MRDIMM Main Memory (BSC, Micron, Intel, UPC)


A new technical paper, "Performance and Energy Benefits of MRDIMMs," was published by researchers at Barcelona Supercomputing Center, Universitat Politecnica de Catalunya, Micron and Intel Corporation. Abstract "Multiplexed Rank DIMMs (MRDIMMs) have recently emerged as memory devices that enable higher bandwidth without increasing DRAM chip frequencies. This paper presents a detailed perf... » read more

Microarchitecture Tailored to 3D-Stacked Near-Memory Processing LLM Decoding (U. of Edinburgh, Peking U., Cambridge et al.)


A new technical paper, "Rethinking Compute Substrates for 3D-Stacked Near-Memory LLM Decoding: Microarchitecture-Scheduling Co-Design," was published by researchers at University of Edinburgh, Peking University, University of Cambridge, University of Chinese Academy of Sciences, and the Hong Kong University of Science and Technology. Abstract "Large language model (LLM) decoding is a majo... » read more

New Automotive Architectures Are Shaking Up Processor And Memory Choices


Key Takeaways Assisted and autonomous driving require more data from more sensors, and much faster processing of some of that data. The shift to software-defined vehicles and centralized intelligence makes it easier to identify where the most advanced processors and memories are required, and where older and less expensive technologies can be deployed. Technologies that were largely ... » read more

An FPGA-based Accelerator Addressing Bottlenecks in GNN Preprocessing (KAIST et al.)


A new technical paper "AutoGNN: End-to-End Hardware-Driven Graph Preprocessing for Enhanced GNN Performance" was published by researchers at KAIST, Panmnesia, Peking University, Hanyang University, and Pennsylvania State University. Abstract "Graph neural network (GNN) inference faces significant bottlenecks in preprocessing, which often dominate overall inference latency. We introduce Au... » read more

Impact Of On-Chip SRAM Size And Frequency On Energy Efficiency And Performance of LLM Inference (Uppsala Univ.)


A new technical paper titled "Prefill vs. Decode Bottlenecks: SRAM-Frequency Tradeoffs and the Memory-Bandwidth Ceiling" was published by researchers at Uppsala University. Abstract "Energy consumption dictates the cost and environmental impact of deploying Large Language Models. This paper investigates the impact of on-chip SRAM size and operating frequency on the energy efficiency and per... » read more

AI Workloads at the Edge: Ensuring Performance, Privacy, and Security


Experts At The Table: Semiconductor Engineering gathered a group of experts to discuss why some AI workloads are better suited for on-device processing to achieve consistent performance, avoid network connectivity issues, reduce cloud computing costs, and ensure privacy. The panel included Frank Ferro, group director in the Silicon Solutions Group at Cadence; Eduardo Montanez, vice president a... » read more

Boosting Memory Bandwidth Availability By Salvaging Idle I/O Bandwidth Resources (Georgia Tech)


A new technical paper titled "Pushing the Memory Bandwidth Wall with CXL-enabled Idle I/O Bandwidth Harvesting" was published by researchers at Georgia Institute of Technology. Abstract "The continual increase of cores on server-grade CPUs raises demands on memory systems, which are constrained by limited off-chip pin and data transfer rate scalability. As a result, high-end processors ty... » read more

Optimizing AI Workloads For Edge Computing


Experts At The Table: Semiconductor Engineering gathered a group of experts to discuss how some AI workloads are better suited for on-device processing to achieve consistent performance, avoid network connectivity issues, reduce cloud computing costs, and ensure privacy. The panel included Frank Ferro, group director in the Silicon Solutions Group at Cadence; Eduardo Montanez, vice president an... » read more

← Older posts