Large-scale, SRAM-based LLM Inference Deployment (Groq)


A new technical paper, "SHIP: SRAM-Based Huge Inference Pipelines for Fast LLM Serving," was published by researchers at Nvidia, with work done while at Groq. Abstract "The proliferation of large language models (LLMs) demands inference systems with both low latency and high efficiency at scale. GPU-based serving relies on HBM for model weights and KV caches, creating a memory bandwidth b... » read more

Study Of HW Acceleration for Neural Networks (Arizona State Univ.)


A new technical paper titled "Hardware Acceleration for Neural Networks: A Comprehensive Survey" was published by researchers at Arizona State University. Abstract "Neural networks have become a dominant computational workload across cloud and edge platforms, but their rapid growth in model size and deployment diversity has exposed hardware bottlenecks that are increasingly dominated by mem... » read more