(Vision) Transformers: Rise Of The Chimera

The problem with over-optimizing hardware accelerators for the current state-of-the-art AI models.


It’s 2023 and transformers are having a moment. No, I’m not talking about the latest installment of the Transformers movie franchise, “Transformers: Rise of the Beasts”; I’m talking about the deep learning model architecture class, transformers, that is fueling anticipation, excitement, fear, and investment in AI.

Transformers are not so new in the world of AI anymore; they were first introduced by the team at Google Brain in 2017 in their paper, “Attention is All You Need“. Since their introduction, transformers have inspired a flurry of investment and research which have produced some of the most impactful model architectures and AI products to-date, including ChatGPT, which is an acronym for Chat Generative Pre-trained Transformer.

These products, and the transformers they’re built with, solve Natural Language Processing (NLP) problems, i.e., they consume “language” inputs in the form of text prompts and produce “language” outputs in the form of strings of words and punctuation that are (hopefully) human-readable. In order to produce meaningful outputs, these transformers are trained on truly mind-boggling amounts of textual data on a scale that few companies can afford to implement. The size and scale of the transformers used in these NLP problem domains have contributed to the new moniker for these class of models: Large Language Models (LLMs).

Equally exciting is the adaptation of these transformer architectures to be used in Computer Vision (CV) applications. A new class of models, broadly referred to as Vision Transformers (ViT), were empirically proven to be viable alternatives to more traditional Convolutional Neural Networks (CNN) in the paper “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale“, also published by the team at Google Brain in 2021.

Unlike the NLP problem space, CV models like Vision Transformers are of a size and scale that are approachable for SoC designers targeting the high-performance, edge AI market. There’s just one problem: Vision Transformers are not CNNs and many of the assumptions made by the designers of first-generation Neural Processing Unit (NPU) and AI hardware accelerators found in today’s SoCs do not translate well to this new class of models.

In this article, we’ll explain:

  1. What makes Vision Transformers (ViT) so special in comparison to their CNN counterparts,
  2. why these unique architectural features of ViTs are “breaking” almost all NPU and AI hardware accelerators targeting the high-performance edge market, and
  3. how Quadric’s Chimera GPNPU architecture is able to run ViTs today with real-time throughput at ~1.2W.

Lastly, we’ll make and defend a simple prediction: The System-on-Chip (SoC) for AI applications that most easily adapts to new model architectures, like Vision Transformers (ViT), will win in the market long-term.

If you’re a SoC designer looking to enable state-of-the-art ViT models for your developers today and whichever model architectures become state-of-the-art tomorrow, this article is for you.

What makes Vision Transformers so special?

ViTs garnered a lot of hype because the team at Google Brain proved that they were viable alternatives to CNNs. To understand what makes ViTs so special, let’s compare them to their CNN counterparts.

CNNs, as their name suggests, are built using convolutional filters. Let’s briefly refresh ourselves on what convolutional filters actually do.

In figure 1 below, we have a 6×6 input matrix on the left, a 3×3 convolutional filter in the middle, and a 4×4 output tensor on the right. The output tensor’s values are calculated by multiplying each 3×3 section of the input matrix on the left with the 3×3 convolutional filter.

This particular convolutional filter, with positive 1 values in its left column, 0 values in its middle column, and negative 1 values in its right column, produces positive output values where vertical edges are found in the original matrix, i.e. in the middle of the example input matrix.

Fig. 1: Convolutional filter for detecting vertical edges.

The important thing to note from the above example is that convolutional filters, like this vertical edge detection filter, learn local features within an image. CNNs have many of these filters and each filter learns what values will extract the most meaningful information from the input image, but each filter only considers information in a localized window of the input image, e.g., a 3×3 crop of the image.

By stacking layers of these convolutional filters on top of one another, i.e., creating deep neural networks (DNN), these local filters gradually gain greater attention over more abstract patterns that exist in larger sections of the image because they are consuming as inputs the filter outputs from a collection of adjacent local filters. We can see this progression of learned abstractions – from edges, to textures, to patterns, to object parts, to full objects – by inspecting the intermediate layers of DNNs at different depths as depicted below in figure 2.

Fig. 2: Progression of learned abstractions by visualizing features of a pre-trained DNN at increasingly deep layers. Original image available in this blog post by Google Research team.

Vision Transformers are a revolutionary idea in the world of Computer Vision (CV) because they employ global attention at each layer.

Attention, as introduced in the paper “Attention is All You Need“, is a dense mapping of weights between different elements in a sequence. These weights represent the relative importance of each element in the sequence to all other elements in the sequence.

To more intuitively understand attention, let’s look at the sentence below:

“I poured water from the bottle into the cup until it was full.”

As seasoned communicators, we might infer from context that the “it” pronoun is referring to the “cup” noun in this sentence because of the adjective “full“; however, by changing the word “full” to “empty“, the reference object for “it” changes from “cup” to “bottle“. We change this inference without much thought because of the innate knowledge we have about how the verb “pour” works, i.e the act of pouring implies that the bottle is losing water and the cup is gaining water. This example demonstrates the relative importance of the word “full” to the context of the word “it” in this sentence.

Notice that in this example we did not consider groups of three words at a time, but instead considered the entire sentence at the same time. Conceptually, this is what it means to have global attention, and it can be very useful in comparison to local attention in inferring context within a problem space.

There’s just one significant problem with the concept of global attention employed by transformers: it’s a dense mapping and dense mappings scale quadratically.

In the above sentence, there are 13 words, i.e. N=13 elements in the sequence. To achieve global attention on this sequence, we need W=N*(N-1) or W=13*12=156 weights to represent the relative importance of each element to each other element (excluding ground truth class labels and patch delineators).

Fig. 3: Visualization of an attention map.

This operation is expensive, but feasible for this two-dimensional data. Unfortunately, global attention becomes untenable when we try to adapt to higher dimensional data like RGB images used in CV applications, i.e. when N=224×224=50,176 pixels in an image and we need W=50176*(50175)=2,517,580,800 weights for global attention.

To solve this problem, ViTs preprocess the three-dimensional image data into a two-dimensional representation. They accomplish this by:

  1. Splitting the inputs into patches,
  2. creating a linear projection or two-dimensional “embedding” of each patch, and
  3. linearly combining a constant positional vector with the patch embedding to retain the patches position within the original image.

Fig. 4: Depiction of image preprocessing required for Vision Transformers (ViT). Original image pulled from this paper.

To put it more simply, in traditional NLP transformers, the input sequences of data are sentences composed of words. Analogously in Vision Transformers, each image is a “sentence” and each patch embedding is a “word”.

At face value, these concepts do not seem to be so revolutionary. Dense or fully-connected layers were implemented as a part of Multilayer Perceptrons (MLP), the earliest proof-of-concept of neural networks. Similarly, the preprocessing needed for image vectorization is, fundamentally, just a form of mathematical embedding learned by a neural network.

Why then should ViT be challenging to run on my SoC’s NPU or AI accelerator?

Why is it so hard to get ViT to run on my AI accelerator?

To understand this why it’s so hard to run ViT on most AI accelerators, we need to understand:

  1. The sequence and type of operators that make up a transformer encoder, and
  2. the architectural assumptions made by NPU and AI accelerator designers.

Transformer Encoder vs. CNN

Earlier in figure 4, we looked at an image focusing on the pre-processing needed to adapt three-dimensional image data to work with a transformer architecture. Below in figure 5, we zoom out to see what happens after the image data is preprocessed:

Fig. 5: Entire Vision Transformer (ViT) architecture. Original image pulled from this paper.

Specifically, we want to look at the Transformer Encoder block on the right side of the image above. These encoder blocks are stacked L times for different sizes of ViT models, just like how ResNet-18 and ResNet-50 models are the same architecture with different numbers of stacked residual blocks.

The key differences to note between the ViT Transformer Encoder block and most CNN blocks is that it has Normalization (represented as Norm layers in figure 5) and Softmax layers (the activation function used for the MLP layer in figure 5) in the middle of the network. In almost all CNN architectures, Normalization is performed once at the beginning of inference and Softmax is performed once at the end of the network.

Normalization and softmax layers are simple enough mathematical operations that do operate on large tensors in the context of DNNs. The challenge these pose to many AI SoCs targeting the edge is that they cannot be accelerated by linear algebra accelerators and in heterogeneous compute platforms need to be processed by a DSP, GPU, or CPU.

Architectural assumptions made by NPU designers

Heterogeneous compute nodes are computing devices with different architectures optimized for specific tasks, e.g. an AI SoC might include a CPU, a DSP, and an Neural Processing Unit (NPU), like the design on the left in figure 6 below.

Fig. 6: A heterogeneous AI SoC design with a dedicated NPU, DSP and CPU (left) compared with a homogeneous SoC design with a single, Chimera general-purpose NPU (GPNPU) processor core (right). Original image pulled from Quadric website.

Heterogeneous computing, as a design principle for AI, requires that programs be segmented into their component tasks and each task must target its most optimal compute node for runtime. If programmed or compiled incorrectly to target an inefficient compute node, e.g., the CPU instead of the AI accelerator, the runtime performance of the program can suffer greatly.

Heterogeneous computing platforms, and the NPU cores used within them, have been optimized for performance on most CNNs. Since most CNNs do not have any softmax or normalization operators in the middle of the network, most NPUs have been designed to optimize for only the convolutional compute which is just basic linear algebra.

NPUs have optimized for these multiply-accumulate (MAC) operations that constitute linear algebra math with great success and heterogeneous computing platforms that use these NPUs have excelled at running CNNs because there’s very infrequent, if any, data movement between compute nodes during inference. The entire inference program can be easily pipelined into three stages:

  1. Input data is color converted, reshaped, formatted, and normalized by a GPU or DSP,
  2. formatted data is off-loaded to the NPU for the multiply-accumulate (MAC), linear algebra operations like convolutions and fully-connected layers, and
  3. convolutional outputs are off-loaded to the GPU or DSP for softmax activation.

Heterogeneous computing platforms can hide most of the expensive memory-movement operations in these types of programs by pipelining the compute. Latency, or the time it takes to run the first inference, may be long, but throughput, the time it takes to run inference on average, is only limited by the slowest stage in this pipeline.

This runtime strategy, when applied to ViT architectures, creates a pipeline that requires frequent data movement between the different compute nodes:

  1. Input data is color converted, reshaped, formatted, and normalized by a GPU or DSP,
  2. formatted data is off-loaded to the NPU for the linear projection of image patches into two-dimensions,
  3. image patches are sent back to the GPU or DSP for normalization,
  4. normalized patches are sent back to the NPU for attention mapping,
  5. back to the GPU or DSP for normalization,
  6. back to the NPU for MLP layer,
  7. back to GPU or DSP for Softmax activation
  8. Repeat steps 3-7 for L stacked transformer encoder blocks. (The smallest “base” ViT model has L=6, the large has L=12, and the huge has L=16.)

This frequent movement of intermediate tensors between different compute nodes results in complex scheduling algorithms and significant overhead. This overhead of moving data between compute nodes substantially reduces the runtime efficiency of a model and burns excessive power. In an AI SoC targeting power-sensitive edge applications, those extra memory-movement operations may render the system unviable.

Optimizing AI SoCs for performance on CNNs has enabled a lack of curiosity surrounding how to accelerate inference broadly. Heterogeneous computing platforms are using existing hardware IP and optimizing it for performance on AI tasks using complex software tricks. The only new hardware block that has been invented to address AI applications is the NPU and it was assumed that the only operations it would need to accelerate were MAC operations that make up the convolutional and dense layers in the middle of CNNs. The pervasiveness of this mindset can be seen by some NPU developers reporting model complexity in number of MAC operations. If MAC counts alone were indicative of a model’s complexity, ViTs would not be so challenging to run on AI SoCs that are optimized with these assumptions.


After reading this article, hopefully you’ve come to appreciate three things:

  1. ViTs are a clever adaptation of the popular transformer architecture that works for Computer Vision applications,
  2. ViTs employ a unique permutation of common ML operators in comparison to the previously most popular CV architectures, Convolutional Neural Networks (CNN), and
  3. ViTs are problematic for heterogeneous AI SoCs with NPUs that were designed to only accelerate multiply-accumulate (MAC) operations.

The context that we have yet to add to this article so far is that the original ViT is already outdated. The hard truth is that the original ViT introduced to the world in 2021 will be remembered the same way the AlexNet architecture is remembered as the proof-of-concept for CNNs. AlexNet got everyone excited about CNNs potential and it was quickly improved upon.

Fig. 7: Timeline of Convolutional Neural Networks (CNN). Original image pulled from this article.

Similarly, there are already numerous variants of ViT that have improved upon the original architecture in the two short years since transformers were proven to be viable for CV applications.

Fig. 8: Timeline of Vision Transformers (ViT). Original image pulled from this article.

One of the greatest challenges facing hardware designers today is how to make your AI SoC future-proof; however, no one can predict what artificial neural network architectures will become the most popular among developers in the future.

To ensure that your AI hardware remains relevant, you need hardware that is generic to the compute problem-space of AI and not over-optimized for the current state-of-the-art solutions, i.e. Convolutional Neural Networks (CNN). If you’re worried about your NPU being rendered extinct by the next wave of DNN architectures, ask your in-house NPU team or third-party provider:

  • Can you run Vision Transformers (ViT) or similar model architectures with non-MAC compute layers like normalization, softmax, and patch creation interleaved with convolutional and dense layers?
  • Are your tensor transformations operators, like resize, transpose, etc., easily programmable and easily parallelized?
  • If tensor transformation operations are found in the middle of a model, like those proposed in Swin transformers, does that significantly hurt compute efficiency?
  • Can you easily run models that are quantized asymmetrically to maximize the effective range of your lower-precision datatypes?
  • Can developers easily program these algorithms for our NPU or system in a user-friendly language like C++? Or must they write machine assembly or use specific intrinsic to optimize for our hardware?

Quadric is solving this problem by defining a new, hybrid architecture capable of running scalar, vector, and matrix instructions. Our Chimera General-Purpose NPU (GPNPU) processors are designed to be a single processor solution for all AI/ML compute. It can handle the image pre-processing, inference, and post-processing all in the same core. Because all compute is handled in a single core with a shared memory hierarchy, no data movement is needed between compute nodes for different types of ML operators.

Always having intermediate tensor data in local memory gives our Chimera Graph Compiler (CGC) an enormous amount of flexibility for operator fusion, which further reduces memory movement overhead and improves program efficiency. In short, GPNPUs deliver the matrix-optimized performance you expect from a CNN-optimized compute engine and the ability to compute non-MAC operations in a single processor architecture. Further, these hardware capabilities are easily accessible to software developers via C++ libraries.

This design approach has enabled us the ability to run an int8 quantized version of ViT-base-patch16-224 in real time at ~1.2W.

If you’re a hardware designer looking to enable ViTs for your developers today and the DL architectures of the future tomorrow, consider signing up for a Quadric DevStudio account to learn more about the Chimera GPNPU processor IP from Quadric.

Leave a Reply

(Note: This name will be displayed publicly)