Systematic Analysis of CPU-Induced Slowdowns in Multi-GPU LLM Inference (Georgia Tech)


A new technical paper, "Characterizing CPU-Induced Slowdowns in Multi-GPU LLM Inference," was published by the Georgia Institute of Technology. Abstract "Large-scale machine learning workloads increasingly rely on multi-GPU systems, yet their performance is often limited by an overlooked component: the CPU. Through a detailed study of modern large language model (LLM) inference and servin... » read more

Ultra-low-bit LLM Inference Allows AI-PC CPUs And Discrete Client GPUs To Approach High-end GPU-Level (Intel)


A new technical paper titled "Pushing the Envelope of LLM Inference on AI-PC and Intel GPUs" was published by researcher at Intel. Abstract "The advent of ultra-low-bit LLM models (1/1.58/2-bit), which match the perplexity and end-task performance of their full-precision counterparts using the same model size, is ushering in a new era of LLM inference for resource-constrained environments... » read more

Impact Of On-Chip SRAM Size And Frequency On Energy Efficiency And Performance of LLM Inference (Uppsala Univ.)


A new technical paper titled "Prefill vs. Decode Bottlenecks: SRAM-Frequency Tradeoffs and the Memory-Bandwidth Ceiling" was published by researchers at Uppsala University. Abstract "Energy consumption dictates the cost and environmental impact of deploying Large Language Models. This paper investigates the impact of on-chip SRAM size and operating frequency on the energy efficiency and per... » read more

Study Of HW Acceleration for Neural Networks (Arizona State Univ.)


A new technical paper titled "Hardware Acceleration for Neural Networks: A Comprehensive Survey" was published by researchers at Arizona State University. Abstract "Neural networks have become a dominant computational workload across cloud and edge platforms, but their rapid growth in model size and deployment diversity has exposed hardware bottlenecks that are increasingly dominated by mem... » read more

Heterogeneous System With Specialized HW For Disaggregated LLM Inference (Princeton Univ., Univ. of Washington)


A new technical paper titled "SPAD: Specialized Prefill and Decode Hardware for Disaggregated LLM Inference" was published by researchers at Princeton University and University of Washington. Abstract "Large Language Models (LLMs) have gained popularity in recent years, driving up the demand for inference. LLM inference is composed of two phases with distinct characteristics: a compute-boun... » read more

HW-SW Co-Designed System With 3 Core Optimization Pathways For Long-Context Agentic LLM Inference (Cambridge, ICL)


A new technical paper titled "Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference" was published by researchers at University of Cambridge, Imperial College London and University of Edinburgh. Abstract "LLMs now form the backbone of AI agents for a diverse array of applications, including tool use, command-line agents, and web or computer use agents. The... » read more

Dynamic KV Cache Scheduling in Heterogeneous Memory Systems for LLM Inference (Rensselaer Polytechnic Institute, IBM)


A new technical paper titled "Accelerating LLM Inference via Dynamic KV Cache Placement in Heterogeneous Memory System" was published by researchers at Rensselaer Polytechnic Institute and IBM. Abstract "Large Language Model (LLM) inference is increasingly constrained by memory bandwidth, with frequent access to the key-value (KV) cache dominating data movement. While attention sparsity red... » read more

Scheduling Architecture Integrated With M3D BEOL Memories For LLM Inference (Georgia Tech, Samsung)


A new technical paper titled "Architecting Long-Context LLM Acceleration with Packing-Prefetch Scheduler and Ultra-Large Capacity On-Chip Memories" was published by researchers at Georgia Institute of Technology and Samsung. Abstract "Long-context Large Language Model (LLM) inference faces increasing compute bottlenecks as attention calculations scale with context length, primarily due to t... » read more

LLM Inference: Core Bottlenecks Imposed By Memory, Compute Capacity, Synchronization Overheads (NVIDIA)


A new technical paper titled "Efficient LLM Inference: Bandwidth, Compute, Synchronization, and Capacity are all you need" was published by NVIDIA. Abstract "This paper presents a limit study of transformer-based large language model (LLM) inference, focusing on the fundamental performance bottlenecks imposed by memory bandwidth, memory capacity, and synchronization overhead in distributed ... » read more

Wafer-Scale Computing for LLMs (U. of Edinburgh, Microsoft)


A new technical paper titled "WaferLLM: A Wafer-Scale LLM Inference System" was published by researchers at University of Edinburgh and Microsoft Research. Abstract "Emerging AI accelerators increasingly adopt wafer-scale manufacturing technologies, integrating hundreds of thousands of AI cores in a mesh-based architecture with large distributed on-chip memory (tens of GB in total) and ultr... » read more

← Older posts