Fallback Fails Spectacularly

Shifting inference workloads from the NPU leads to more than just a little slow down.


Conventional AI/ML inference silicon designs employ a dedicated, hardwired matrix engine – typically called an “NPU” – paired with a legacy programmable processor – either a CPU, or DSP, or GPU.

The common theory behind these two-core (or even three core) architectures is that most of the matrix-heavy machine learning workload runs on the dedicated accelerator for maximum efficiency and the programmable core is there as the backup engine – commonly known as a fallback core – for running new ML operators as the state of the art evolves.

In previous blogs, we’ve highlighted the conceptual failings of the fallback concept.

ConvNext – a specific, current day example

One of the more recent new ML networks is ConvNext, boasting top-1 accuracy levels of as much as 87.8% – nearly 10 full percentage points higher than the Resnet architectures of just a few years ago. Two key layers in the ConvNext family of networks that typically are not present in fixed-function NPUs are the LayerNorm normalization function and the GeLU activation used in place of the previously more common ReLU activation. In the conventional architectures promoted by other IP vendors, the GeLU and LayerNorm operations would run on the fallback DSP, not the accelerator. The result – shown in the calculations in the illustration below – are shocking.

ConvNext performance – fallback is unusable.

Our analysis of ConvNext on an NPU+DSP architecture suggests a throughput of less than 1 inference per second. Note that these numbers for the fallback solution assume perfect 100% utilization of all the available ALUs in an extremely wide 1024-bit VLIW DSP. Reality would undoubtably be below the speed-of-light 100% mark, and the FPS would suffer even more. In short, fallback is unusable. Far from being a futureproof safety net, the DSP or CPU fallback mechanism is a death-trap for your SoC.

Is ConvNext an outlier?

A skeptic might be thinking: “Yeah, but Quadric probably picked an extreme outlier for that illustration. I’m sure there are other networks where fallback works perfectly OK!” Actually, ConvNext is not an outlier. It only has two layer types requiring fallback.

An extreme outlier would be the SWIN (shifted window) Transformer, which when executed on an AI hardware accelerator designed for an older CNN architecture could lead to 77% of the workload running on the fallback processor. At that rate, why even bother to have an NPU taking up area on the die? Just run the entire network – very, very slowly – on a bank of DSPs. Or, throw away your chip design and start over again – if you can convince management to give you the $250M budget for a new tapeout in an advanced node!

Programmable, General Purpose NPUs (GPNPU)

There is a new architectural alternative to fallback – the Chimera GPNPU from Quadric. By merging the best attributes of dataflow array accelerators with full programmability (see last month’s blog for details), Quadric delivers high ML performance while maintaining endless programmability. The basic Chimera architecture pairs a block of multiply-accumulate (MAC) units with a full-function 32bit ALU and a small local memory. This MAC-ALU-memory building block – called a processing element (PE) – is then tiled into a matrix of 64, 256 or 1024 PEs to form a family of GPNPUs offering from 1 TOP to 16 TOPs each, and multicore configurations reaching hundreds of TOPs. Part systolic array and part DSP, the Chimera GPNPU delivers what fallback fails to deliver – the promise of efficiency and complete futureproofing. See more at quadric.io.

Leave a Reply

(Note: This name will be displayed publicly)