Don’t Let Your ML Accelerator Vendor Tell You The ‘F-Word’

Why fallback is a dirty word.


Machine learning (ML) inference in devices is all the rage. Nearly every new system on chip (SoC) design start for mobile phones, tablets, smart security cameras, automotive applications, wireless systems, and more has a requirement for a hefty amount of ML capability on-chip. That has silicon design teams scrambling to find ML processing power to add to the existing menu of processing engines – CPUs, DSPs, GPUs – in their bag of design tricks.

The reason chip design teams are looking for new solutions is that ML workloads are far different than the workloads the existing building blocks were originally optimized for. CPUs are designed to run many simultaneous threads of random control code with random memory accesses. GPUs were designed to draw polygons in graphics applications. And DSPs were designed to tackle vector mathematics in 1-D and 2-D arrays of data. But ML inference workloads are primarily dominated by matrix computations (convolutions) on N-dimensional tensor data. The new compute challenge doesn’t fit neatly onto old compute architectures.

The approach most design teams – and most commercial IP processor vendors – have taken to solve the new ML matrix compute problem is to attempt to force-fit the new compute workload onto the old platforms. These IP vendors analyzed existing ML benchmarks to identify the most-frequently occurring major computation operators in ML workloads and built offload engines (accelerators) that efficiently execute those select compute building blocks. The underlying theory of this strategy: if the 10 or 20 most common ML graph operators represent 95-98% of the computation workload, offloading those 20 graph operators from the pre-existing CPU or DSP allows the fully flexible CPU or DSP to orchestrate the rest of the graph execution, including the rare or unusual operators in the ML graph. IP vendors refer to this division of labor as “Operator Fallback” – where the vast majority of computation runs on the non-programmable ML accelerator but the program “falls back” to the fully programmable CPU or DSP when needed.

The Achilles’ heel – the fatal flaw – of this approach is the assumption that Fallback is rare and not performance-critical. But a closer look at the approach reveals Fallback to be a naughty word – a new F word to be avoided at all costs. Consider the example of an SoC with a large, general purpose applications-class CPU, a vector DSP engine tuned for vision processing, and a 4 TOP/s ML accelerator. The compute resource available in each engine is shown in the table below:

A matrix operation running on the accelerator is fast – taking advantage of all 2048 multiply-accumulate units in the accelerator. But the same or similar operator running on the DSP is 32X slower! And on the CPU is 128X slower. It doesn’t take an advanced machine learning mathematics degree to see that even if only 5% of the total computation of a machine learning workload needs to fallback onto the CPU that small 5% suddenly becomes the performance bottleneck of the entire inference execution. If 98% of the computation blazes fast on the accelerator and the complex SoftMax final layer of the graph executes 100X or 1000X slower on the CPU, the entire inference time is dominated by the slow CPU performance.

Fallback only gets worse with the passage of time. Machine learning is rapidly evolving. The reference models of 2022 or 2023 will surely be replaced by newer, more accurate and more complex ML models in 2025 or 2026 just as the silicon being designed today enters volume production. Those new ML models in three years’ time will likely have new operator variants or new network topologies – necessitating even more Fallback onto the flexible but slow CPU or DSP. Total performance on the multicore, heterogenous accelerator architecture will degrade even more, rendering the chip design severely underperforming or even completely inappropriate for the task. The designers of those failed chips will mumble curse-words under their breath as they bemoan the failure of Fallback. Fallback will be their F-word, indeed.

What is the alternative?

There is an alternative to Fallback. The accelerator itself needs to be just as programmable as the CPU or DSP. And it must be programmable in C++, so engineers easily can add new operations as ML tasks evolve.

The Chimera general purpose NPU (GPNPU) from Quadric – available in 1 TOPS, 4 TOPS, and 16 TOPS variants – delivers the matrix-optimized performance you expect from an ML-optimized compute engine while also being fully C++ programmable by the software developer. New ML operators can be quickly written and run just as fast as the “native” operators written by Quadric engineers. With a Chimera programmable accelerator, there is no Fallback, only fast execution – no matter what new forms of operators or graphs the future brings. With Quadric, the F words are good words – Fast, Future-Proof, Fantastic!

Leave a Reply

(Note: This name will be displayed publicly)