Home
TECHNICAL PAPERS

System-Level Approach To Reducing HBM Cost for AI inference (RPI, IBM)

popularity

A new technical paper titled “Breaking the HBM Bit Cost Barrier: Domain-Specific ECC for AI Inference Infrastructure” was published by researchers at Rensselaer Polytechnic Institute and IBM.

Abstract
“High-Bandwidth Memory (HBM) delivers exceptional bandwidth and energy efficiency for AI workloads, but its high cost per bit, driven in part by stringent on-die reliability requirements, poses a growing barrier to scalable deployment. This work explores a system-level approach to cost reduction by eliminating on-die ECC and shifting all fault management to the memory controller. We introduce a domain-specific ECC framework combining large-codeword Reed–Solomon~(RS) correction with lightweight fine-grained CRC detection, differential parity updates to mitigate write amplification, and tunable protection based on data importance. Our evaluation using LLM inference workloads shows that, even under raw HBM bit error rates up to 10-3, the system retains over 78\% of throughput and 97\% of model accuracy compared with systems equipped with ideal error-free HBM. By treating reliability as a tunable system parameter rather than a fixed hardware constraint, our design opens a new path toward low-cost, high-performance HBM deployment in AI infrastructure.

Find the technical paper here. July 2025.

Xie, Rui, Asad Ul Haq, Yunhua Fang, Linsen Ma, Sanchari Sen, Swagath Venkataramani, Liu Liu, and Tong Zhang. “Breaking the HBM Bit Cost Barrier: Domain-Specific ECC for AI Inference Infrastructure.” arXiv preprint arXiv:2507.02654 (2025).



Leave a Reply


(Note: This name will be displayed publicly)