Complex model architectures, demanding runtime computations, and transformer-specific operations introduce unique challenges.
Since the groundbreaking 2017 publication of “Attention Is All You Need,” the transformer architecture has fundamentally reshaped artificial intelligence research and development. This innovation laid the foundation for Large Language Models (LLMs) and Video Language Models (VLMs), fueling a wave of productization across the industry. A defining milestone was the public launch of ChatGPT in November 2022, which brought transformer-powered AI into mainstream use. Since then, LLMs have enabled a broad spectrum of applications, from conversational agents to advancements in medical research.
However, running these LLMs efficiently presents substantial challenges, particularly on edge computing devices and legacy hardware architectures that were designed before the widespread adoption of large language models.
One of the significant difficulties facing AI processors is the sheer size of LLMs compared to prior state-of-the-art CNNs (Convolutional Neural Networks), RNNs (Recurrent Neural Networks), and other network types.
Among these CNNs and RNNs, an 85 million-parameter model would have been considered large. In comparison, even a modestly sized LLM might have 1B parameters, while models with 8B parameters, and larger, are commonplace. Said plainly, there is no mass market, cost-effective method to load that many parameters on a single chip; thus, pre-existing solutions may not be effective.
Consider Llama 3.2, Meta’s latest generation of LLMs, which introduced significant advancements in both text and multimodal (text + vision) AI capabilities. This release expands on previous versions with new model variants and features designed for both enterprise and edge-device deployment. Llama 3.2 contains 1 billion parameters. Furthermore, attention operations (O) increase with context size (n). During prefill, where operations are predominantly done in parallel, the compute load is a function of the square of the context size, O(n2). The prefill phase is compute-bound, meaning its speed is limited by the raw computational power of the hardware. In contrast, decode is predominantly sequential, and the compute load is an order of magnitude smaller, O(n)*n per token. However, the decode phase is dominated by memory access speed rather than compute power, and the per-token compute cost can be orders of magnitude higher than during prefill.
The LLM inference flow begins with a user prompt, which is a sentence spoken or input by the user. In Figure 1 below, we use the example of “May the force.” The user prompt is first translated into what is called a token, a mathematical representation of the user prompt, using a processing mechanism aptly named the “tokenizer.” The token is then sent to the inference processing steps, where they are divided into two phases: the prefill phase and the decode phase.
In the prefill phase, all the tokens are sent at once through a series of transformer blocks. At each transformer layer, the Key (K) and Value (V) vectors for each token are stored in a cache — this is called the KV cache or attention cache. This cache is then used to make subsequent generations faster. In the decode phase, one response at a time is generated, as this part is sequential in nature.
In Figure 1, examples of this are the generation of “be, with” and “you”.
Fig. 1: LLM inference flow.
During the prefill stage, the model needs to compute attention over all previous tokens. However, during the decode stage, the processor only needs to compute attention over the new token because it reuses the cached values from prefill. This enables efficient autoregressive decoding — the processor doesn’t need to recompute everything from scratch each time. The prefill stage is compute-heavy because it processes the entire input prompt.
Compute cost scales linearly with prompt length – if the prompt is N tokens and the model has L layers, the processor must do N × L full forward passes. However, the decode stage is relatively fast after that because the system processes only one token at a time, reusing the cache. This raises an issue, though – as the phases are very different in compute, memory, and power, how can a solution be enabled which is optimal for both?
Fig. 2: LLM inference runtime.
Let’s also examine the runtime changes between traditional AI networks and LLMs, which are illustrated in Figure 2.
Traditional CNNs have a simple, monolithic runtime with only two phases:
LLMs introduce a multi-phase runtime system with five distinct phases, each with different computational and memory requirements.
Prefill Phase
Decode Phase
Inactive Phase
Follow-up Prefill
Retired Phase
This multi-phase complexity significantly exacerbates deployment difficulties for LLMs compared to traditional AI networks. Managing multiple phases simultaneously, maintaining expensive KV caches, and handling dynamic transitions between phases create substantial challenges for efficient LLM deployment and resource management.
Large Language Models introduce distinct challenges for inference hardware and software systems, including complex model architectures, demanding runtime computations, transformer-specific operations, and implementation considerations. To address these requirements, AI processing platforms must advance to accommodate diverse data representations, support multiple precision formats, deliver enhanced computational throughput, and enable efficient multi-core processing architectures.
Expedera explores this further in our white paper at: https://www.expedera.com/transformers-at-the-edge/
Leave a Reply