Large-scale, SRAM-based LLM Inference Deployment (Groq)


A new technical paper, "SHIP: SRAM-Based Huge Inference Pipelines for Fast LLM Serving," was published by researchers at Nvidia, with work done while at Groq. Abstract "The proliferation of large language models (LLMs) demands inference systems with both low latency and high efficiency at scale. GPU-based serving relies on HBM for model weights and KV caches, creating a memory bandwidth b... » read more

Alleviating the DRAM Capacity Bottleneck in Consumer Devices with NVMs


A new technical paper titled "Extending Memory Capacity in Modern Consumer Systems With Emerging Non-Volatile Memory: Experimental Analysis and Characterization Using the Intel Optane SSD" was published by researchers at ETH Zurich, University of Illinois Urbana-Champaign, Google, and Rivos. Abstract Excerpt "DRAM scalability is becoming a limiting factor to the available memory capacity in... » read more