A new technical paper titled "Hardware-Efficient Attention for Fast Decoding" was published by researchers at Princeton University.
Abstract
"LLM decoding is bottlenecked for large batches and long contexts by loading the key-value (KV) cache from high-bandwidth memory, which inflates per-token latency, while the sequential nature of decoding limits parallelism. We analyze the interplay amo...
» read more