What makes vision transformers so special?
Transformers were first introduced by the team at Google Brain in 2017 in their paper, “Attention is All You Need”. Since their introduction, transformers have inspired a flurry of investment and research which have produced some of the most impactful model architectures and AI products to-date, including ChatGPT which is an acronym for Chat Generative Pre-trained Transformer.
Transformers are also being employed for vision applications (ViTs). This new class of models was empirically proven to be viable alternatives to more traditional Convolutional Neural Networks (CNNs) in the paper “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale”, published by the team at Google Brain in 2021.
Vision transformers are of a size and scale that are approachable for SoC designers targeting the high-performance, edge AI market. There’s just one problem: vision transformers are not CNNs and many of the assumptions made by the designers of first-generation Neural Processing Unit (NPU) and AI hardware accelerators found in today’s SoCs do not translate well to this new class of models.
Click here to read more.
Leave a Reply