Trapped By Legacy

Just bolting a matrix accelerator onto existing processor IP leads to long-term challenges.

popularity

At Quadric, we do a lot of first-time introductory visits with prospective new customers. As a rapidly expanding processor IP licensing company that is starting to get noticed (even winning IP Product of the Year!) such meetings are part of the territory. Which means we hear a lot of similar-sounding questions from appropriately skeptical listeners who hear our story for the very first time. The question most asked in those meetings sounds something like this:

“Chimera GPNPU sounds like the kind of breakthrough I’ve been looking for. But tell me, why is Quadric the only company building a completely new processor architecture for AI inference? It seems such an obvious benefit to tightly integrate the matrix compute with general purpose compute, instead of welding together two different engines across a bus and then partitioning the algorithms. Why don’t some of the bigger, more established IP vendors do something similar?”

The answer I always give: “They can’t, because they are trapped by their own legacies of success!”

A dozen different solutions that all look remarkably alike

A long list of competitors in the “AI Accelerator” or “NPU IP” licensing market arrived at providing NPU solutions from a wide variety of starting points. Five or six years ago, CPU IP providers jumped into the NPU accelerator game to try to keep their CPUs relevant with a message of “use our trusted CPU and offload those pesky, compute hungry matrix operations to an accelerator engine.” DSP IP providers did the same. As did configurable processor IP vendors. Even GPU IP licensing companies did the same thing. The playbook for those companies was remarkably similar: (1) tweak the legacy offering instruction set a wee bit to boost AI performance slightly, and (2) offer a matrix accelerator to handle the most common one or two dozen graph operators found in the ML benchmarks of the day: Resnet, Mobilenet, VGG.

The result was a partitioned AI “subsystem” that looked remarkably similar across all the 10 or 12 leading IP company offerings: legacy core plus hardwired accelerator.

The fatal flaw in these architectures: always needing to partition the algorithm to run on two engines. As long as the number of “cuts” of the algorithm remained very small, these architectures worked very well for a few years. For a Resnet benchmark, for instance, usually only one partition is required at the very end of the inference. Resnet can run very efficiently on this legacy architecture. But along came transformers with a very different and wider set of graph operators that are needed, and suddenly the “accelerator” doesn’t accelerate much, if any, of the new models and overall performance became unusable. NPU accelerator offerings needed to change. Customers with silicon had to eat the cost – a very expensive cost – of a silicon respin.

An easy first step becomes a long-term prison

Today, these IP licensing companies find themselves trapped. Trapped by their decisions five years ago to take an “easy” path towards short-term solutions. The motivations why all of the legacy IP companies took this same path has as much to do with human nature and corporate politics as it does with technical requirements.

When what was then generally referred to as “machine learning” workloads first burst onto the scene in vision processing tasks less than a decade ago, the legacy processor vendors were confronted with customers asking for flexible solutions (processors) that could run these new, fast-changing algorithms. Caught flat-footed with processors (CPU, DSP, GPU) ill-suited to these new tasks, the quickest short-term technical fix was the external matrix accelerator. The option of building a longer-term technical solution – a purpose built programmable NPU capable of handling all 2000+ graph operators found in the popular training frameworks – would take far longer to deliver and incur much more investment and technical risk.

The not-so-hidden political risk

But let us not ignore the human nature side of the equation faced by these legacy processor IP companies. A legacy processor company choosing a strategy of building a completely new architecture – including new toolchains/compilers – would have to explicitly declare both internally and externally that the legacy product was simply not as relevant to the modern world of AI as that legacy (CPU, DSP, GPU) IP core previously was valued. The breadwinner of the family that currently paid all the bills would need to pay the salaries of the new team of compiler engineers working on the new architecture that effectively competed against the legacy star IP. (It is a variation on the Innovator’s Dilemma problem.) And customers would have to adjust to new, mixed messages that declare “the previously universally brilliant IP core is actually only good for a subset of things – but you’re not getting a royalty discount.”

All of the legacy companies chose the same path: bolt a matrix accelerator onto the cash cow processor and declare that the legacy core still reigns supreme. Three years later staring at the reality of transformers, they declared the first-generation accelerator obsolete and invented a second one that repeated the same shortcomings of the first accelerator. And now faced with the struggles of the 2nd iteration hardwired accelerator having also become obsolete in the face of continuing evolution of operators (self-attention, multiheaded self-attention, masked self-attention, and more new ones daily) they either have to double-down again and convince internal and external stakeholders that this third time the fixed-function accelerator will solve all problems forever; or admit that they need to break out of the confining walls they’ve built for themselves and instead build a truly programmable, purpose-built AI processor.

But the SoC architect doesn’t have to wait for the legacy IP company

The legacy companies might struggle to decide to try something new. But the SoC architect building a new SoC doesn’t have to wait for the legacy supplier to pivot. A truly programmable, high-performance AI solution already exists today.

The Chimera GPNPU from Quadric runs all AI/ML graph structures. The Chimera GPNPU processor integrates fully programmable 32-bit ALUs with systolic-array style matrix engines in a fine-grained architecture. Up to 1024 ALUs in a single core, with only one instruction fetch and one AXI data port. That’s over 32,000 bits of parallel, fully-programmable performance.

The flexibility of a processor with the efficiency of a matrix accelerator. Scalable up to 864 TOPS for bleeding-edge applications, Chimera GPNPUs have matched and balanced compute throughput for both MAC and ALU operations so no matter what type of network you choose to run they all run fast, low-power and highly parallel. When a new AI breakthrough comes along in five years, the Chimera processor of today will run it – no hardware changes, just application SW code. Learn more at www.quadric.io.



Leave a Reply


(Note: This name will be displayed publicly)