Rapid innovation in transformer networks has rendered the simple offload NPU obsolete.
In the 1986 hit comedy movie “Crocodile Dundee,” the title character – a rough and tumble Australian transported to the mean streets of New York City – is confronted by street thugs brandishing a switchblade knife who demand his wallet. In response he cooly smirks and pulls a knife out of his belt that is 10 times the size of the would-be assailants’ weapon and delivers the signature line of dialog of the film, “That’s not a knife. <pauses, pulls out his own much larger weapon> Now, THAT’S a knife!”
Regular readers of this blog are likely familiar with the author’s penchant for obscure analogies. So you’re probably wondering, “How does Crocodile Dundee relate to SoC design?” In the same way that Mick Dundee brought a much bigger knife to a knife fight, the same thing holds true if you are to believe the marketing claims of many of the would-be AI NPU licensors who highlight the wonders of their DSPs (or vector extensions to their CPUs) as the magical elixir to solve the programmability problem with matrix accelerators: the bigger the DSP the better.
Modern AI or Machine Learning workloads are both computationally intensive (billions of calculations) as well as complex (hundreds or thousands of different operations). A full decade ago – 2015 – the ResNet family of CNNs was first published. While the workhorse ResNet50 network was far too computationally intensive to run on the existing CPUs or DSPs of the day, the highly repetitive nature of Resnets – Resnet18, for instance, consists of the same 8 operators repeated 18 times in the “backbone” of the graph – made them ideal candidates to accelerate using fixed-function matrix offload engines attached to legacy CPUs or DSPs.
But the intervening decade of rapid innovation has rendered the simple offload NPU as obsolete. Today’s leading AI training framework, Pytorch, has more than 2300 different graph operators. Modern transformer networks can have 300 or more variants of operators intertwined with each other – matrix operations followed by mathematically complex arithmetic operations (e.g., root mean square calculations, sigmoid calculations, others), followed by more matrix multiplication filtering. The common approach to implement these varied operations is to build efficient multiply-accumulate engines (MAC arrays) alongside more flexible, more general-purpose arithmetic logic units (ALUs) that run the non-MAC functions.
Legacy CPU and DSP vendors tout the added vector processing width in their newest and biggest DSPs and CPU vector extensions as the perfect “future proofing” answer for the next unexpected AI algorithm breakthrough. Countless RISC-V startups highlight their novel 512-bit or 1024-bit vector extensions to the CPU instruction set. DSP vendors have pushed VLIW machine widths from 256 bits up to 512 bits and now 1024 bits, with some even supporting as many as four simultaneous VLIW bundles operating 4 x 256 = 1024 bits in parallel. Thus, if these aspiring NPU licensing companies can magically manage to eliminate the data transfer penalties associated with physically separate NPU and DSP engines, a fact they conveniently ignore in their marketing, they claim their 1024-bit “knives” are the weapon of choice for the AI knife fight.
Is a 1024-bit DSP really the solution you want in your next chip as you fight the evolving AI battle? Or would you want a Crocodile Dundee-sized DSP solution? Quadric’s Chimera architecture is unique in crafting building blocks placing small tiles of MACs immediately adjacent to a full-fledged, programmable 32-bit ALU. That building block (MACs, ALU, SRAM) is tiled in two-dimensional array fashion. The largest of our three general purpose NPU (GPNPU) processors – the Chimera QC Ultra – contains 1024 building block tiles all operating in unison executing the same instruction. That’s 1024 ALUs, each 32b wide. 32,768 bits of fully C++ programmable parallelism. Thirty-two times larger and faster than any competing AI processor offering.
As Mick Dundee would say. “Now, THAT’S a DSP!” Want to find out more? Visit www.quadric.io
Leave a Reply