ConvNext Runs 28X Faster Than Fallback

Graph compiler enables rapid support for emerging AI models.

popularity

Two months ago in our blog we highlighted the fallacy of using a conventional NPU accelerator paired with a DSP or CPU for “fallback” operations. (Fallback Fails Spectacularly, May 2024). In that blog we calculated what the expected performance would be for a system with a DSP needing to perform the new operations found in one of today’s leading new ML networks – ConvNext. The result was dismal – less than one inference per second trying to run the ConvNext backbone if the novel Operators – GeLu and LayerNorm – are not present in the hardwired NPU accelerator and instead those Ops ran on the largest and most powerful DSP core on the market today.

On July 16, Quadric unveiled its first implementations of ConvNext networks. Compared to the DSP-NPU Fallback performance, Quadric’s Chimera QC Ultra GPNPU is 28 times faster (28,000%!) than the DSP+NPU combo for the same large input image size: 1600 x 928. Not a toy network, but a real useful image size. And Quadric has posted results for not just one but three different variants of the ConvNext: both Tiny and Small variants, and with input image sizes at 640×640, 800×800, 928×1600. The results are all available for inspection online inside the Quadric Developers’ Studio.

Compiled code, not hand-crafted

Why did we deliver three different variants of ConvNext all at one time? And why were those three variants added to the online DevStudio environment at the same time as more than 50 other new network examples a mere 3 months (July 2024) after our previous release (April)? Is it because Quadric has an army of super-productive data scientists and low-level software coders who sat chained to their computers 24 hours a day, 7 days a week in a forced labor camp while being fed a steady diet of candy bars and energy drinks? Or did the sudden burst of hundreds of new networks happen because of major new feature improvements in our Chimera Graph Compiler (CGC)? Turns out that it is far more productive – and ethical – for us to build graph compilation technology rather than violate labor laws.

This latest version of the CGC compiler fully handles all the graph operators found in ConvNext, meaning that we are able to take model sources directly from the model repo on the web and automatically compile the source model to target our Chimera GPNPU. In the case of ConvNext, that was a repo on GitHub from OpenMMLab with a source Pytorch model. Login to DevStudio to see the links and sources yourself! Not hand-crafted. No operator swapping. No time-consuming surgical alteration of the models. Just a world-class compilation stack targeting a fully-programmable GPNPU that can run any machine learning model.

Explosive growth in the number of supported models

Perhaps more impressive than our ability to support ConvNext – at a high 28 FPS frame rate – is the startling increase in the sheer number of working models that compile and run on the Chimera processor as our toolchain rapidly matures (see chart):

More models, faster code

Not only did we take a huge leap forward in the breadth of model support in this latest software release, but we also continue to deliver incremental performance improvements on previous running models as the optimization passes inside the CGC compiler continue to expand and improve. Some networks saw improvements of over 50% in just the past 3 months. We know our ConvNext numbers will rapidly improve. And the performance improvements will continue to flow in the coming months and years as our compiler technology continues to evolve and mature, because unlike a hardwired accelerator with fixed feature sets and fixed performance the day you tape-out, a processor can continue to improve as the optimizations and algorithms in the toolchain evolve.

Programmable, general purpose NPUs (GPNPU)

By merging the best attributes of dataflow array accelerators with full programmability, Quadric delivers high ML performance while maintaining endless programmability. The basic Chimera architecture pairs a block of multiply-accumulate (MAC) units with a full-function 32bit ALU and a small local memory. This MAC-ALU-memory building block – called a processing element (PE) – is then tiled into a matrix of PEs to form a family of GPNPUs offering from 1 TOP to 864 TOPs. Part systolic array and part DSP, the Chimera GPNPU delivers the promise of efficiency and complete futureproofing. See more at quadric.io.



Leave a Reply


(Note: This name will be displayed publicly)