A new technical paper titled "SPAD: Specialized Prefill and Decode Hardware for Disaggregated LLM Inference" was published by researchers at Princeton University and University of Washington.
Abstract
"Large Language Models (LLMs) have gained popularity in recent years, driving up the demand for inference. LLM inference is composed of two phases with distinct characteristics: a compute-boun...
» read more