Home

TECHNICAL PAPERS

Scheduling Architecture Integrated With M3D BEOL Memories For LLM Inference (Georgia Tech, Samsung)

August 19th, 2025 - By: Technical Paper Link

A new technical paper titled “Architecting Long-Context LLM Acceleration with Packing-Prefetch Scheduler and Ultra-Large Capacity On-Chip Memories” was published by researchers at Georgia Institute of Technology and Samsung.

Abstract
“Long-context Large Language Model (LLM) inference faces increasing compute bottlenecks as attention calculations scale with context length, primarily due to the growing KV-cache transfer overhead that saturates High Bandwidth Memory (HBM). While prefetching techniques mitigate cache misses by fetching KV data in advance, their spatial and temporal benefits present new opportunities to exploit. This work proposes a packing-prefetch scheduling architecture with monolithic 3D (M3D) back-end-of-line (BEOL) compatible embedded memories with ultra-large on-chip capacity to accelerate long-context LLM inference. Our optimizations demonstrate 8.06x decode speedup and 1.83x overall latency reduction on Llama3.1-8B using TPUv6e-like hardware with additional 512MB BEOL memories over the serial execution. Evaluations of multi-request workloads on TPU-like architectures show 1.7x-2.4x throughput improvement and 1.5x-2.4x HBM bandwidth reduction compared to packing-only methods on Llama3.1-8B and Llama3.1-70B models. With the co-design of packing, prefetching, and BEOL memories, our approach alleviates HBM constraints and enables efficient long-context LLM inference.”

Find the technical paper here. August 2025.

Lee, Ming-Yen, Faaiq Waqar, Hanchen Yang, Muhammed Ahosan Ul Karim, Harsono Simka, and Shimeng Yu. “Architecting Long-Context LLM Acceleration with Packing-Prefetch Scheduler and Ultra-Large Capacity On-Chip Memories.” arXiv preprint arXiv:2508.08457 (2025).

Scheduling Architecture Integrated With M3D BEOL Memories For LLM Inference (Georgia Tech, Samsung)

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers
Entities, people and technologies explored

Related Articles

Startup Funding: Q1 2026

Advanced Packaging Limits Come Into Focus

All AI Data Center Interconnects Will Be Optical Within 5 Years

The Sub-2nm Paradox

When Semiconductor Materials Misbehave

TSMC Tech Symposium 2026, By The Numbers

Silicon Photonics Lights The Way To More Efficient Data Centers

Memory Wall Gets Higher

Sponsors

Recent Comments

About

Navigation

Connect With Us

Scheduling Architecture Integrated With M3D BEOL Memories For LLM Inference (Georgia Tech, Samsung)

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers Entities, people and technologies explored

Related Articles

Startup Funding: Q1 2026

Advanced Packaging Limits Come Into Focus

All AI Data Center Interconnects Will Be Optical Within 5 Years

The Sub-2nm Paradox

When Semiconductor Materials Misbehave

TSMC Tech Symposium 2026, By The Numbers

Silicon Photonics Lights The Way To More Efficient Data Centers

Memory Wall Gets Higher

Sponsors

Newsletter Signup

Popular Tags

Recent Comments

About

Navigation

Connect With Us

Knowledge Centers
Entities, people and technologies explored