SPONSOR BLOG

Transformers At The Edge: Efficient LLM Deployment

Complex model architectures, demanding runtime computations, and transformer-specific operations introduce unique challenges.

July 17th, 2025 - By: Paul Karazuba

Since the groundbreaking 2017 publication of “Attention Is All You Need,” the transformer architecture has fundamentally reshaped artificial intelligence research and development. This innovation laid the foundation for Large Language Models (LLMs) and Video Language Models (VLMs), fueling a wave of productization across the industry. A defining milestone was the public launch of ChatGPT in November 2022, which brought transformer-powered AI into mainstream use. Since then, LLMs have enabled a broad spectrum of applications, from conversational agents to advancements in medical research.

However, running these LLMs efficiently presents substantial challenges, particularly on edge computing devices and legacy hardware architectures that were designed before the widespread adoption of large language models.

One of the significant difficulties facing AI processors is the sheer size of LLMs compared to prior state-of-the-art CNNs (Convolutional Neural Networks), RNNs (Recurrent Neural Networks), and other network types.

Among these CNNs and RNNs, an 85 million-parameter model would have been considered large. In comparison, even a modestly sized LLM might have 1B parameters, while models with 8B parameters, and larger, are commonplace. Said plainly, there is no mass market, cost-effective method to load that many parameters on a single chip; thus, pre-existing solutions may not be effective.

Consider Llama 3.2, Meta’s latest generation of LLMs, which introduced significant advancements in both text and multimodal (text + vision) AI capabilities. This release expands on previous versions with new model variants and features designed for both enterprise and edge-device deployment. Llama 3.2 contains 1 billion parameters. Furthermore, attention operations (O) increase with context size (n). During prefill, where operations are predominantly done in parallel, the compute load is a function of the square of the context size, O(n2). The prefill phase is compute-bound, meaning its speed is limited by the raw computational power of the hardware. In contrast, decode is predominantly sequential, and the compute load is an order of magnitude smaller, O(n)*n per token. However, the decode phase is dominated by memory access speed rather than compute power, and the per-token compute cost can be orders of magnitude higher than during prefill.

Exploring LLM inference flow

The LLM inference flow begins with a user prompt, which is a sentence spoken or input by the user. In Figure 1 below, we use the example of “May the force.” The user prompt is first translated into what is called a token, a mathematical representation of the user prompt, using a processing mechanism aptly named the “tokenizer.” The token is then sent to the inference processing steps, where they are divided into two phases: the prefill phase and the decode phase.

In the prefill phase, all the tokens are sent at once through a series of transformer blocks. At each transformer layer, the Key (K) and Value (V) vectors for each token are stored in a cache — this is called the KV cache or attention cache. This cache is then used to make subsequent generations faster. In the decode phase, one response at a time is generated, as this part is sequential in nature.

In Figure 1, examples of this are the generation of “be, with” and “you”.

Fig. 1: LLM inference flow.

During the prefill stage, the model needs to compute attention over all previous tokens. However, during the decode stage, the processor only needs to compute attention over the new token because it reuses the cached values from prefill. This enables efficient autoregressive decoding — the processor doesn’t need to recompute everything from scratch each time. The prefill stage is compute-heavy because it processes the entire input prompt.

Compute cost scales linearly with prompt length – if the prompt is N tokens and the model has L layers, the processor must do N × L full forward passes. However, the decode stage is relatively fast after that because the system processes only one token at a time, reusing the cache. This raises an issue, though – as the phases are very different in compute, memory, and power, how can a solution be enabled which is optimal for both?

LLM inference runtime

Fig. 2: LLM inference runtime.

Let’s also examine the runtime changes between traditional AI networks and LLMs, which are illustrated in Figure 2.

Traditional CNNs have a simple, monolithic runtime with only two phases:

Data loading phase
Inference phase

LLMs introduce a multi-phase runtime system with five distinct phases, each with different computational and memory requirements.

Prefill Phase

Processes the initial user prompt by embedding and tokenizing the entire input sequence
Runs all transformer layers in full sequence mode (dense computation)
Initializes and populates the Key-Value (KV) cache with attention values
Generally has higher per-token latency due to sequence processing
May use microbatching for lengthy inputs

Decode Phase

Generates output tokens one at a time autoregressively
Only processes the last generated token per step
Retrieves past tokens from the KV cache for efficiency
Computes self-attention only against past tokens
Highly optimized with KV caching and batching

Inactive Phase

No computation occurs, but the sequence remains “alive” in memory
Occurs when waiting for new user input in streaming/chat interfaces
KV cache remains in memory (costly resource usage)
Can become a bottleneck in high-throughput systems with many cached sequences

Follow-up Prefill

Triggered when new user input is appended to a partially generated sequence (multi-turn conversations)
Processes new input as a short prefill segment, appending to existing cached context
Updates KV cache with new tokens
Distinct from initial prefill as it operates on shorter, appended segments

Retired Phase

Sequence is terminated and removed from the active pool
KV cache is freed and resources are released
Triggered by conversation completion, user cancellation, or timeouts
Frees memory and scheduling capacity for other sequences

This multi-phase complexity significantly exacerbates deployment difficulties for LLMs compared to traditional AI networks. Managing multiple phases simultaneously, maintaining expensive KV caches, and handling dynamic transitions between phases create substantial challenges for efficient LLM deployment and resource management.

New architectures for LLMs

Large Language Models introduce distinct challenges for inference hardware and software systems, including complex model architectures, demanding runtime computations, transformer-specific operations, and implementation considerations. To address these requirements, AI processing platforms must advance to accommodate diverse data representations, support multiple precision formats, deliver enhanced computational throughput, and enable efficient multi-core processing architectures.

Expedera explores this further in our white paper at: https://www.expedera.com/transformers-at-the-edge/

Paul Karazuba

(all posts)
Paul Karazuba is vice president of marketing at Expedera.

Transformers At The Edge: Efficient LLM Deployment

Exploring LLM inference flow

LLM inference runtime

New architectures for LLMs

Paul Karazuba

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers
Entities, people and technologies explored

Related Articles

RISC-V’s Increasing Influence

Development Flows For Chiplets

Chiplet Tradeoffs And Limitations

Implementing AI Activation Functions

Die-to-die Interconnect Standards In Flux

The Best DRAMs For Artificial Intelligence

Future-proofing AI Models

AI Accelerators Moving Out From Data Centers

Sponsors

Recent Comments

About

Navigation

Connect With Us

Transformers At The Edge: Efficient LLM Deployment

Exploring LLM inference flow

LLM inference runtime

New architectures for LLMs

Paul Karazuba

Leave a Reply Cancel reply

Technical Papers

Knowledge Centers Entities, people and technologies explored

Related Articles

RISC-V’s Increasing Influence

Development Flows For Chiplets

Chiplet Tradeoffs And Limitations

Implementing AI Activation Functions

Die-to-die Interconnect Standards In Flux

The Best DRAMs For Artificial Intelligence

Future-proofing AI Models

AI Accelerators Moving Out From Data Centers

Sponsors

Newsletter Signup

Popular Tags

Recent Comments

About

Navigation

Connect With Us

Knowledge Centers
Entities, people and technologies explored