A new technical paper, “SHIP: SRAM-Based Huge Inference Pipelines for Fast LLM Serving,” was published by researchers at Nvidia, with work done while at Groq.
Abstract
“The proliferation of large language models (LLMs) demands inference systems with both low latency and high efficiency at scale. GPU-based serving relies on HBM for model weights and KV caches, creating a memory bandwidth bottleneck during decode. To break through this bottleneck, we present the first large-scale, SRAM-based LLM inference deployment—Groq’s public cloud—serving hundreds of billions of tokens daily. This paper reviews Groq’s first-generation SRAM-based Huge Inference Pipelines (SHIP), highlighting: (1) a synchronous, low-diameter interconnect enabling low-latency scaling across thousands of chips; (2) optimizations for LLM serving under limited memory capacity; and (3) a large pipeline design that sustains efficiency and latency under varying prefill-to-decode ratios and context lengths. Together, these yield state-of-the-art latency while maintaining efficiency across diverse traffic scenarios—key to real-world LLM serving.”
Find the technical paper here. March 2026 and last modified May 2026.
Bitar, Andrew, Aravind Vellora Vayalapra, Baorui Zhou, Matt Boyd, Charlie Wang, Sahil Parmar, Eugene Sha, et al. 2026. “SHIP: SRAM-Based Huge Inference Pipelines for Fast LLM Serving.” In Proceedings of the Ninth Conference on Machine Learning and Systems. https://openreview.net/forum?id=IZaXDwDtL1.

Leave a Reply