Transformers At The Edge: Efficient LLM Deployment

Complex model architectures, demanding runtime computations, and transformer-specific operations introduce unique challenges.

popularity

Since the groundbreaking 2017 publication of “Attention Is All You Need,” the transformer architecture has fundamentally reshaped artificial intelligence research and development. This innovation laid the foundation for Large Language Models (LLMs) and Video Language Models (VLMs), fueling a wave of productization across the industry. A defining milestone was the public launch of ChatGPT in November 2022, which brought transformer-powered AI into mainstream use. Since then, LLMs have enabled a broad spectrum of applications, from conversational agents to advancements in medical research.

However, running these LLMs efficiently presents substantial challenges, particularly on edge computing devices and legacy hardware architectures that were designed before the widespread adoption of large language models.

One of the significant difficulties facing AI processors is the sheer size of LLMs compared to prior state-of-the-art CNNs (Convolutional Neural Networks), RNNs (Recurrent Neural Networks), and other network types.

Among these CNNs and RNNs, an 85 million-parameter model would have been considered large. In comparison, even a modestly sized LLM might have 1B parameters, while models with 8B parameters, and larger, are commonplace. Said plainly, there is no mass market, cost-effective method to load that many parameters on a single chip; thus, pre-existing solutions may not be effective.

Consider Llama 3.2, Meta’s latest generation of LLMs, which introduced significant advancements in both text and multimodal (text + vision) AI capabilities. This release expands on previous versions with new model variants and features designed for both enterprise and edge-device deployment. Llama 3.2 contains 1 billion parameters. Furthermore, attention operations (O) increase with context size (n). During prefill, where operations are predominantly done in parallel, the compute load is a function of the square of the context size, O(n2). The prefill phase is compute-bound, meaning its speed is limited by the raw computational power of the hardware. In contrast, decode is predominantly sequential, and the compute load is an order of magnitude smaller, O(n)*n per token. However, the decode phase is dominated by memory access speed rather than compute power, and the per-token compute cost can be orders of magnitude higher than during prefill.

Exploring LLM inference flow

The LLM inference flow begins with a user prompt, which is a sentence spoken or input by the user. In Figure 1 below, we use the example of “May the force.” The user prompt is first translated into what is called a token, a mathematical representation of the user prompt, using a processing mechanism aptly named the “tokenizer.” The token is then sent to the inference processing steps, where they are divided into two phases: the prefill phase and the decode phase.

In the prefill phase, all the tokens are sent at once through a series of transformer blocks. At each transformer layer, the Key (K) and Value (V) vectors for each token are stored in a cache — this is called the KV cache or attention cache. This cache is then used to make subsequent generations faster. In the decode phase, one response at a time is generated, as this part is sequential in nature.

In Figure 1, examples of this are the generation of “be, with” and “you”.

Fig. 1: LLM inference flow.

During the prefill stage, the model needs to compute attention over all previous tokens. However, during the decode stage, the processor only needs to compute attention over the new token because it reuses the cached values from prefill. This enables efficient autoregressive decoding — the processor doesn’t need to recompute everything from scratch each time. The prefill stage is compute-heavy because it processes the entire input prompt.

Compute cost scales linearly with prompt length – if the prompt is N tokens and the model has L layers, the processor must do N × L full forward passes. However, the decode stage is relatively fast after that because the system processes only one token at a time, reusing the cache. This raises an issue, though – as the phases are very different in compute, memory, and power, how can a solution be enabled which is optimal for both?

LLM inference runtime

Fig. 2: LLM inference runtime.

Let’s also examine the runtime changes between traditional AI networks and LLMs, which are illustrated in Figure 2.

Traditional CNNs have a simple, monolithic runtime with only two phases:

  • Data loading phase
  • Inference phase

LLMs introduce a multi-phase runtime system with five distinct phases, each with different computational and memory requirements.

Prefill Phase

  • Processes the initial user prompt by embedding and tokenizing the entire input sequence
  • Runs all transformer layers in full sequence mode (dense computation)
  • Initializes and populates the Key-Value (KV) cache with attention values
  • Generally has higher per-token latency due to sequence processing
  • May use microbatching for lengthy inputs

Decode Phase

  • Generates output tokens one at a time autoregressively
  • Only processes the last generated token per step
  • Retrieves past tokens from the KV cache for efficiency
  • Computes self-attention only against past tokens
  • Highly optimized with KV caching and batching

Inactive Phase

  • No computation occurs, but the sequence remains “alive” in memory
  • Occurs when waiting for new user input in streaming/chat interfaces
  • KV cache remains in memory (costly resource usage)
  • Can become a bottleneck in high-throughput systems with many cached sequences

Follow-up Prefill

  • Triggered when new user input is appended to a partially generated sequence (multi-turn conversations)
  • Processes new input as a short prefill segment, appending to existing cached context
  • Updates KV cache with new tokens
  • Distinct from initial prefill as it operates on shorter, appended segments

Retired Phase

  • Sequence is terminated and removed from the active pool
  • KV cache is freed and resources are released
  • Triggered by conversation completion, user cancellation, or timeouts
  • Frees memory and scheduling capacity for other sequences

This multi-phase complexity significantly exacerbates deployment difficulties for LLMs compared to traditional AI networks. Managing multiple phases simultaneously, maintaining expensive KV caches, and handling dynamic transitions between phases create substantial challenges for efficient LLM deployment and resource management.

New architectures for LLMs

Large Language Models introduce distinct challenges for inference hardware and software systems, including complex model architectures, demanding runtime computations, transformer-specific operations, and implementation considerations. To address these requirements, AI processing platforms must advance to accommodate diverse data representations, support multiple precision formats, deliver enhanced computational throughput, and enable efficient multi-core processing architectures.

Expedera explores this further in our white paper at: https://www.expedera.com/transformers-at-the-edge/



Leave a Reply


(Note: This name will be displayed publicly)