Large-scale, SRAM-based LLM Inference Deployment (Groq)


A new technical paper, "SHIP: SRAM-Based Huge Inference Pipelines for Fast LLM Serving," was published by researchers at Nvidia, with work done while at Groq. Abstract "The proliferation of large language models (LLMs) demands inference systems with both low latency and high efficiency at scale. GPU-based serving relies on HBM for model weights and KV caches, creating a memory bandwidth b... » read more

Memory For AI At The Edge


Inferencing at the edge has very different needs than training large language models or large-scale inferencing in AI data centers. Many edge devices run on a battery. They're price-sensitive, and they are constrained by the physical area of the device. As a result, the amount of memory that can be packed into these devices is also limited. Steve Woo, Rambus fellow and distinguished inventor, t... » read more

Software-Defined Systems


Using high-level software languages to define semiconductors is faster, easier, and allows for more changes long before the RTL stage. This is especially useful for chiplets and embedded accelerators, which are narrower in scope and more targeted at different workloads and specific domains. But there are some caveats for engineers working in this space. Russell Klein, program director for Sieme... » read more

Four Architectural Opportunities for LLM Inference Hardware (Google)


A new technical paper titled "Challenges and Research Directions for Large Language Model Inference Hardware" was published by Google. Abstract "Large Language Model (LLM) inference is hard. The autoregressive Decode phase of the underlying Transformer model makes LLM inference fundamentally different from training. Exacerbated by recent AI trends, the primary challenges are memory and in... » read more

Why Arm For Cloud: At A Glance


This report explores the performance and cost-efficiency benefits of Arm processors on AWS, specifically examining Arm Neoverse-powered AWS Graviton4 processors in comparison to the latest available generation AMD and Intel based AWS EC2 alternatives. As detailed in this Lab Insight Report, Signal65 conducted hands on performance testing and cost efficiency analysis across four distinct workloa... » read more

How Grinn And Synaptics Are Accelerating Edge AI Adoption And Innovation


From smart cities to industrial automation, organizations are rethinking how and where data is processed. The answer, increasingly, is at the Edge—where devices can analyze information in real time without sending it to the cloud. This approach improves responsiveness, enhances security, and reduces reliance on network connectivity. Recognizing this shift, Grinn Global and Synaptics have p... » read more

HW-SW Co-Designed System With 3 Core Optimization Pathways For Long-Context Agentic LLM Inference (Cambridge, ICL)


A new technical paper titled "Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference" was published by researchers at University of Cambridge, Imperial College London and University of Edinburgh. Abstract "LLMs now form the backbone of AI agents for a diverse array of applications, including tool use, command-line agents, and web or computer use agents. The... » read more

Report: The AI Efficiency Boom


Artificial Intelligence (AI) is undergoing a fundamental transformation. While early AI models were large, compute-heavy, and dependent on cloud processing, a new wave of efficiency-driven innovations is moving AI inference—the generation of model results—to the edge. Smaller models, improved memory and compute performance, and the need for privacy, low latency, and energy efficiency are dr... » read more

Scaling GenAI Training And Inference Chips With Runtime Monitoring


GenAI’s rapid growth is pushing the limits of semiconductor technology, demanding breakthroughs in performance, power efficiency, and reliability. Training and inference workloads for models like GPT-4 and GPT-5 require massive computational resources, leading to skyrocketing costs, energy consumption, and hardware failures. Traditional optimization methods, such as static guard bands and per... » read more

System-Level Approach To Reducing HBM Cost for AI inference (RPI, IBM)


A new technical paper titled "Breaking the HBM Bit Cost Barrier: Domain-Specific ECC for AI Inference Infrastructure" was published by researchers at Rensselaer Polytechnic Institute and IBM. Abstract "High-Bandwidth Memory (HBM) delivers exceptional bandwidth and energy efficiency for AI workloads, but its high cost per bit, driven in part by stringent on-die reliability requirements, pose... » read more

← Older posts