Home

TECHNICAL PAPERS

Large-scale, SRAM-based LLM Inference Deployment (Groq)

May 21st, 2026 - By: Technical Paper Link

A new technical paper, “SHIP: SRAM-Based Huge Inference Pipelines for Fast LLM Serving,” was published by researchers at Nvidia, with work done while at Groq.

Abstract

“The proliferation of large language models (LLMs) demands inference systems with both low latency and high efficiency at scale. GPU-based serving relies on HBM for model weights and KV caches, creating a memory bandwidth bottleneck during decode. To break through this bottleneck, we present the first large-scale, SRAM-based LLM inference deployment—Groq’s public cloud—serving hundreds of billions of tokens daily. This paper reviews Groq’s first-generation SRAM-based Huge Inference Pipelines (SHIP), highlighting: (1) a synchronous, low-diameter interconnect enabling low-latency scaling across thousands of chips; (2) optimizations for LLM serving under limited memory capacity; and (3) a large pipeline design that sustains efficiency and latency under varying prefill-to-decode ratios and context lengths. Together, these yield state-of-the-art latency while maintaining efficiency across diverse traffic scenarios—key to real-world LLM serving.”

Find the technical paper here. March 2026 and last modified May 2026.

Bitar, Andrew, Aravind Vellora Vayalapra, Baorui Zhou, Matt Boyd, Charlie Wang, Sahil Parmar, Eugene Sha, et al. 2026. “SHIP: SRAM-Based Huge Inference Pipelines for Fast LLM Serving.” In Proceedings of the Ninth Conference on Machine Learning and Systems. https://openreview.net/forum?id=IZaXDwDtL1.

Large-scale, SRAM-based LLM Inference Deployment (Groq)

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers
Entities, people and technologies explored

Related Articles

Startup Funding: Q1 2026

All AI Data Center Interconnects Will Be Optical Within 5 Years

The Sub-2nm Paradox

When Semiconductor Materials Misbehave

TSMC Tech Symposium 2026, By The Numbers

Silicon Photonics Lights The Way To More Efficient Data Centers

Memory Wall Gets Higher

TSV Complexity Leads To Manufacturing Bottleneck

Sponsors

Recent Comments

About

Navigation

Connect With Us

Large-scale, SRAM-based LLM Inference Deployment (Groq)

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers Entities, people and technologies explored

Related Articles

Startup Funding: Q1 2026

All AI Data Center Interconnects Will Be Optical Within 5 Years

The Sub-2nm Paradox

When Semiconductor Materials Misbehave

TSMC Tech Symposium 2026, By The Numbers

Silicon Photonics Lights The Way To More Efficient Data Centers

Memory Wall Gets Higher

TSV Complexity Leads To Manufacturing Bottleneck

Sponsors

Newsletter Signup

Popular Tags

Recent Comments

About

Navigation

Connect With Us

Knowledge Centers
Entities, people and technologies explored