Hybrid Architecture Blends Best Of Both Worlds

A true chimera with both CPU and systolic array DNA.


Quadric chose the brand name Chimera to describe the company’s novel general purpose neural processing unit (GPNPU) architecture. According to the online Oxford dictionary, in biology a chimera is “an organism containing a mixture of genetically different tissues (or DNA).” Quadric made that naming choice to reflect the fact that its Chimera GPNPU has characteristics of both conventional CPU/DSP processors and characteristics of dataflow systolic array processors optimized for neural network processing. But what, exactly, does that mean? And how does the Quadric Chimera GPNPU pull off this seemingly mythical hybrid (to borrow the alternate meaning of the word “chimera”) combination of attributes without either sacrificing the compute efficiency of an NPU or the flexibility of a C/C++ programmable CPU?

A True C++ Programmable Processor
The Chimera core is a true programmable processor. Utilizing a proprietary instruction set, it employs a conventional seven-stage pipeline, issuing a single 64-bit instruction per cycle. The machine is straight-forward and highly deterministic, completing instructions in order. Instructions include fields that control the type of computation to be executed, as well as controlling two levels of built-in DMA that flow data in parallel both within the core and between the core and system memory. This conventional processor behavior is programmed using C++ compiled via the industry standard LLVM compiler.  Therefore, the Chimera GPNPU can run any conventional CPU or DSP code that may be part of a novel new machine learning workload, or an associated pre-processing or post-processing step in a data pipeline.

What differentiates a Chimera core from conventional DSPs is the execution pipeline. Occupying four stages of the seven-stage pipeline, a Chimera core contains a 2-D array of processing elements, each PE containing a bank of MACs (multiply accumulate units) and a full 32-bit ALU. Chimera cores scale from 1 TOP to 64 TOPs by scaling the number of PEs in a single Chimera core. It is important to note that these PEs are all part of the processor pipeline — not independent entities — and all controlled by a single shared instruction stream generated by an LLVM compiler.

But What About Convolution Efficiency?
After reading the preceding two paragraphs, most NPU designers are thinking, “but you can’t get good convolution efficiency on a pure processor, no matter how wide the array of processing elements.” And those NPU designers would be correct – *if* the Chimera core was just a conventional processor.  But Chimera cores have a special mode of operation where, under instruction control, a MAC Execution Engine (MEU) is triggered.  The MEU has characteristics of highly optimized dataflow processors optimized for convolutions.  Under MEU control, the Chimera processor temporarily stops issuing and executing instructions. Instead, the MEU cycles all of the PEs in the machine through a complete convolution operation with each PE having not only access to the data resident in its slice of local memory but also all the other data in the neighboring (North, East, West, South) PEs.  When the MEU is active, a Chimera GPNPU is operating exactly like a dataflow array processor – with exactly the same high performance and energy efficiency you would expect from a hardwired systolic array “accelerator.” Chimera cores can run optimized versions of all convolutions – whether 1 x 1, 3 x 3, 5 x 5, 7 x 7 or larger.  Any stride, any dilation, any pattern.

No Data Transfer
The key performance advantage of Chimera cores arises from the fact that both the full 32-bit ALU and the MAC array slices are co-resident, both accessing the same local register memory. When a machine learning graph switches from a convolution operator to a novel activation or normalization or pooling operation no data movement is required. Only one clock tick is needed to use the “DNA” of the other side of the machine. And because the Chimera core employs a full ALU with a set of architectural registers managed by the LLVM C++ compiler any function – even ML operators no one has yet dreamed of – will run, at high performance, on the Chimera core.

Programmable, General Purpose NPUs (GPNPU)
Part systolic array and part DSP, the Chimera GPNPU delivers the promise of machine learning efficiency and complete future-proofing. See more at quadric.io.

Leave a Reply

(Note: This name will be displayed publicly)