Heterogeneous NPU Data Movement: What The Execution Flow Shows

A concrete example of how architectural boundaries influence system behavior.

popularity

Heterogeneous NPU designs bring together multiple specialized compute engines to support the range of operators required by modern AI models. This approach enables coverage across diverse workloads, but it also introduces a structural consequence: intermediate data must move between those engines. That movement consumes power, adds latency, and requires additional silicon resources, with effects that grow alongside model complexity.

A 2024 presentation from Intel describing its Gen 4 NPU architecture provides a detailed view into how this plays out in practice. The material walks through the execution of a transformer graph step by step, making the underlying data flow explicit. In doing so, it offers a concrete example of how architectural boundaries influence system behavior.

The architecture: Two compute domains

The Gen 4 NPU is organized around two distinct types of compute engines. Matrix operations are handled by MAC arrays, which are optimized for dense linear algebra and form the core of transformer inference. Alongside them sit SHAVE DSPs, which provide general-purpose vector processing for operations that fall outside the capabilities of fixed-function matrix units.

These engines are connected through shared infrastructure, including scratchpad memory, cache, an MMU, and DMA. This structure allows the system to support a broad set of operators, but it also creates clear boundaries between compute domains. Whenever execution shifts from one type of engine to another, intermediate data must pass through this shared memory system.

Fig. 1: Intel Gen 4 NPU block diagram from the 2024 Tech Tour. Six Neural Compute Engines, each pairing a MAC Array with SHAVE DSPs, sit beneath shared MMU, DMA, and Scratchpad RAM infrastructure.

Multi-head attention as a case study

The implications of this design become more apparent when examining Multi-Head Attention, a central operation in transformer models. Its formulation is straightforward, combining matrix multiplications with a SoftMax normalization step to produce attention-weighted outputs. In practice, however, executing this sequence requires multiple distinct operations, each of which must be mapped onto available compute resources.

The process begins with linear projections to generate query, key, and value vectors, followed by a matrix multiplication that computes similarity scores. These scores are then normalized using SoftMax before a second matrix multiplication applies the resulting attention weights. The sequence concludes with concatenation and a final projection. While each step is well understood in isolation, the transitions between them reveal how data moves through the system.

Fig. 2: Intel’s MHA computation graph, showing the full operator chain: three Linear projections, two MatMuls, SoftMax, Concat, and a final Linear.

Following the data

In the execution flow described in the presentation, the first matrix multiplication runs on the MAC arrays, producing intermediate activations within that subsystem. The next step, SoftMax, cannot be executed on the same hardware, which requires the data to move out of the MAC array domain, traverse the shared memory hierarchy, and arrive at the SHAVE DSPs for processing.

Once SoftMax completes, the resulting data must again move through shared memory, this time returning to the MAC arrays to perform the second matrix multiplication. The pattern that emerges is a repeated transfer of activations between compute domains, with shared memory acting as the intermediary.

This sequence—MAC arrays to DSPs and back again—occurs within a single attention block. When repeated across multiple heads and layers in a transformer model, it becomes a recurring aspect of execution rather than an isolated event.

Fig. 3: Step 1: The first MatMul (word vector similarity dot-product) executes on the MAC Arrays. Intermediate activations are produced inside the MAC Array subsystem.

System-level implications

The effects of this data movement extend beyond the transfers themselves. Each exchange of intermediate data consumes memory bandwidth, as values must be written out and read back in. This activity places sustained demand on the memory subsystem, even though it does not directly contribute to new computation.

At the same time, coordination between engines introduces additional overhead. Control logic and DMA mechanisms are required to manage data transfers and ensure that operations occur in the correct sequence. These elements occupy silicon area and consume power as part of the system’s normal operation.

Execution timing is also affected. Because each stage depends on the completion of the previous one and the availability of data in shared memory, transitions between engines can introduce idle periods. As models grow deeper and incorporate more complex attention variants, these effects accumulate, making data movement an increasingly significant component of overall execution.

Fig. 4: Step 2: SoftMax (similarity normalization) executes on the SHAVE DSPs. Intermediate activations have crossed from the MAC Array subsystem into DSP territory via shared memory.

An alternative approach

A different architectural approach is to use a unified, programmable compute fabric capable of executing multiple operator types within the same subsystem. In this model, matrix operations, normalization functions, and other elements of the workload are handled without requiring data to move between distinct engine types.

By keeping intermediate activations within a single compute domain, this approach reduces reliance on shared memory for inter-stage communication. It also changes the role of control logic and data movement, shifting more of the system’s activity toward computation rather than coordination. As model architectures evolve, programmability allows new operator patterns to be supported without introducing additional execution boundaries.

Architectural tradeoffs

The contrast between these approaches reflects a broader architectural tradeoff. Heterogeneous designs rely on specialized engines and explicit data movement to handle different parts of a workload, while unified designs emphasize flexibility within a single programmable fabric.

Each approach has implications for how data flows through the system, how resources are allocated, and how performance scales with model complexity. The differences become most visible in workloads like transformer inference, where multiple operator types are tightly coupled within a single execution graph.

Fig. 5: Step 3: The second MatMul (attention score calculation) executes on the MAC Arrays. Activations have completed a full round-trip: MAC Arrays → SHAVE DSPs → MAC Arrays.

Conclusion

The execution flow of transformer models provides a clear lens through which to examine NPU architecture. In heterogeneous designs, movement of intermediate data between compute engines is a recurring and necessary part of operation, with measurable effects on power, latency, and system utilization.

For SoC designers, the key consideration is not simply how individual operators are accelerated, but how the system handles the data between them. Architectural choices made at that level shape the overall efficiency of the processor as models continue to evolve.

Source: Intel Tech Tour 2024 presentation, “Transformer Architecture on Intel’s NPU / Multi-Head Attention Flowchart.” All diagrams reproduced from public presentation for editorial commentary purposes.



Leave a Reply


(Note: This name will be displayed publicly)