Achieving Greater Accuracy In Real-Time Vision Processing With Transformers

A new class of neural network models is opening the door to full visual perception.


Transformers, first proposed in a Google research paper in 2017, were initially designed for natural language processing (NLP) tasks. Recently, researchers applied transformers to vision applications and got interesting results. While previously, vision tasks had been dominated by convolutional neural networks (CNNs), transformers have proven surprisingly adaptable to vision tasks like image classification and object detection. These results have earned transformers a place next to CNNs for vision tasks trying to improve machines’ understanding of the world for future applications like context aware video inference.

In 2012, a CNN called AlexNet was the winner of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), an annual computer vision competition. The task was to have your machine learn and ‘classify’ 1000 different images (based on the ImageNet dataset). AlexNet achieved a top-5 error rate of 15.3%. Previous winners, based on a traditional programming model, had top-5 error rates around 26% (see figure 1). Subsequent years were dominated by CNNs. In 2016 and 2017, the winning CNNs achieved better than human accuracy and the majority of participants achieved over 95% accuracy, prompting ImageNet to roll out a new, more difficult challenge in 2018. The dominance of CNNs in ILSVRC drove a flurry of research applying CNNs to real-time vision applications. While accuracy continued to improve there was a 10x improved efficiency from ResNet in 2015 and EfficientNet in 2020. Real-time vision applications require not just accuracy, but improved performance (inference/sec or frames-per-second (fps)), a reduced model size (improving bandwidth), and power and area efficiency.

Fig. 1: ILSVRC results highlight the significant improvements in accuracy for vision classification introduced by AlexNet, a convolutional neural network.

Classification is a building block for more complicated, and more useful, vision applications like object detection (finding the location of the object in the two dimensional image), semantic segmentation (grouping/labeling every pixel in an image) and panoptic segmentation (both identifying object locations and labeling/grouping every pixel in every object).

Transformers, as first introduced in Google Brain’s 2017 paper, were designed to improve upon recurrent neural networks (RNNs) and long short-term memory (LSTM) for NLP tasks like translation, question answering and conversational AI. RNNs and LSTMs have been used to process sequential data (i.e. digitized language and speech) but their architectures are not easily parallelizable, and thus are typically very bandwidth-limited and difficult to train. The structure of a transformer has several advantages over RNNs and LSTMs. Unlike RNNs and LSTMs that must read a string of text sequentially, transformers are significantly more parallelizable and can read in a complete sequence of words at once, allowing them to better learn contextual relationships between words in a text string.

A popular transformer for NLP, released in late 2018 by Google, is Bidirectional Encoded Representation for Transformers (BERT). BERT significantly improved results for a variety of NLP tasks and is popular enough to be included in MLCommons’ MLPerf neural network inference benchmark suite. In addition to high accuracy, transformers are much easier to train, making huge transformers possible. MTM, GPT-3, T5, ALBERT, RoBERTa, T5, Switch AS are just some of the large transformers tackling NLP tasks. Generative Pre-trained Transformer 3 (GPT-3), introduced in 2020 by OpenAI, uses deep learning to produce human-like text and does this so accurately it can be difficult to determine if the text was written by a human.

Transformers like BERT can be successfully applied in other application domains with promising results for embedded use. AI models that can be trained on broad data and applied to a wide range of applications have been dubbed foundation models. One of these domains that transformers have had surprising success in is vision.

Transformers applied to vision

Something remarkable happened in 2021. The Google Brain team applied their transformer model to image classification. There is a big difference between a sequence of words and a two-dimensional image, but the Google Brain team cut the image into small patches, put the pixels in these patches into a vector and fed the vector into the transformer. The results were surprising. Without any modification to the model, the transformer beat current state-of-the-art CNNs in classification in accuracy. While accuracy isn’t the only metric for real-time vision applications (power, cost(area) and inferences/sec are also important), it was a significant result in the vision world.

Fig. 2: Comparing Transformer and CNN structures.

It’s helpful when comparing CNNs and transformers to understand their similar structures. In figure 2, a transformer’s structure consists of the boxes on the left side of the image. For comparison, we draw a similar structure for CNNs using typical CNN constructs like those found in ResNet – a 1×1 convolution with element-wise addition. We find the feed forward portion of the transformer is functionally identical to the 1×1 convolution of the CNN. These are matrix-matrix multiplies that apply a linear transformation on every point in the feature map.

The difference between transformers and CNNs is in how each mixes information from neighboring pixels. This happens in the transformer’s multi-head attention and the convolutional network’s 3×3 convolution. For CNNs, the information that is mixed in is based on the fixed spatial location of each pixel, as we see in figure 3. For a 3×3 convolution, a weighted sum is calculated using neighboring pixels – the nine pixels around the center pixel.

Fig. 3: Illustrating the difference between how a CNN’s convolution and a transformer’s attention networks mix in features of other tokens/pixels.

The transformer’s attention mechanism mixes in data not just based on location but based on learned properties. Transformers – during training – can learn to pay attention to other pixels. Attention networks have greater ability to learn and express more complex relationships.

Introducing vision transformers and shifted windows transformers

New transformers are emerging specifically for vision tasks. Vision Transformers (ViTs), specializing in image classification, are now beating CNNs in accuracy (although to achieve this accuracy, ViTs need to be trained with very large data sets). ViTs also take a lot more computations, which lowers their fps performance.

Transformers are also being applied to object detection and semantic segmentation. Swin (shifted window) Transformers provide state-of-the-art accuracy for object detection (COCO) and semantic segmentation (ADE20K). While CNNs are typically applied to still images – with no knowledge of previous or future frames – transformers can be applied across video frames. Variants of SWIN can be directly applied to video for uses like action classification. Applying transformers’ attention separately on time and on space have given state of the art results on Kinetics-400 and Kinetics-600 action classification benchmarks.

MobileViT (figure 4), introduced in early 2022 by Apple, provides an interesting mix of both transformer and convolutions. MobileViT combines transformer and CNN features to create a lightweight model for vision classification targeting mobile applications. This combination of transformer and convolution, when compared to the CNN-only MobileNet, has 3% higher accuracy for the same size model (6M coefficients). Although MobileViT outperforms MobileNet, it is still slower than CNN implementations on today’s mobile phones, which support CNNs but were not optimized for transformers. To take advantage of the benefits of transformers, future AI accelerators for vision will need better transformer support.

Fig. 4: MobileViT: Light-weight, general-purpose, and mobile-friendly vision transformer (

Despite the demonstrated successes of transformers for vision tasks, it is unlikely that convolutional networks are going to go away and time soon. There are still trade-offs between the two approaches – transformers bring higher accuracy but at much less fps performance and requiring a lot more computations and data movement. To avoid the weaknesses of each, combining transformers and CNNs can produce flexible solutions that shows great promise.

Implementing transformers

Although architecturally there are similarities, it would be unrealistic to hope that an accelerator designed specifically for CNNs will be efficient at executing transformers. Architectural enhancements needed to be considered to handle the attention mechanism at a minimum.

An example of an AI accelerator that was designed to handle both CNNs and transformers efficiently is the ARC NPX6 NPU IP from Synopsys. The NPX6’s computation units (figure 5) include a convolution accelerator which is designed to handle matrix-matrix multiplications critical to both CNNs and transformers. The tensor accelerator is also critical, as it was designed to handle all other non-convolution Tensor Operator Set Architecture (TOSA) operations including transformer operations.

Fig. 5: Synopsys ARC NPX6 NPU IP.


Transformers for vision have made rapid advancements and are here to stay. These attention-based networks outperform CNN-only networks in accuracy. Models that combine vision transformers with convolutions are more efficient at inference (like MobileViT) and improve on performance efficiency. This new class of neural network models is opening the door to address future AI tasks like full visual perception, which requires knowledge that may not easily be acquired by vision only. Transformers combined with CNNs are leading the way to next-generation AI. Choosing architectures that support both CNNs and transformers will be critical to SoC success for emerging AI applications.

Leave a Reply

(Note: This name will be displayed publicly)