Downsize your big, expensive applications-class CPU.
Imagine you are an architect designing a new SoC for an application that needs substantial machine learning inferencing horsepower. The team in marketing has given you a list of ML workloads and performance specs that you need to hit. The in-house designed NPU accelerator works well for these known workloads – things like MobileNet v2 and Resnet50. The accelerator speeds up 95+% of the compute of the reference workloads in marketing’s wishlist, and the only demanding piece of code the accelerator cannot perform is the SoftMax function at the tail end of the classical backbone graph.
Your proposed architecture of the chip solves this by pairing the in-house NPU with a big applications-class CPU that boasts a wide vector DSP extension. That CPU – or cluster of CPUs – can also serve as the “insurance policy” to implement unplanned, future ML graph operators that might creep into future workloads.
You think this architecture might just do the trick. If you are lucky, the in-house core development team that built the NPU also created a System C simulation model of that accelerator, thus you can test out some of the system workloads in a larger SoC simulation. For those 10 known workloads in marketing’s dream sheet, all things look good so far.
But just to be safe, you take a quick survey of the ML acceleration product offerings from some of the more well-known CPU, GPU and DSP processor vendors that your company licenses IP from. What you discover is that all those external IP offerings from legacy processor vendors look exactly like what your team has built: a CNN/convolution accelerator paired with a legacy CPU/DSP that handles a few complex layers (Softmax, ArgMax) and serves as the insurance “fallback” just in case something new appears in a year or two. If the leading processor vendors are offering the same concepts, you must be on the right track!
All of your homework is pointing in the same direction, but you still have two concerns: (1) what if the fallback processor doesn’t have quite enough horsepower for the unknown AI/ML workloads of the future? And (2) what if that applications CPU does squeak by on performance, but the cost – in both silicon area for dual-core or quad-core cluster, and in the high royalty rate – makes your company’s product uncompetitive in the market? To get the extra “insurance” you need, you chose the quad-core CPU cluster and hope it does the trick – better to be safe on performance at the expense of a 3% or 4% or 5% royalty.
Now, imagine all of this occurred a short three years ago, the autumn of 2020. Suddenly, today in 2023, the emergence of transformer networks – both vision transformers (ViT networks) and large-language transformer models (LLMs) has hampered your chip architecture because even that expensive quad-core CPU doesn’t quite have the juice to do what you need, and you are still paying a very large fallback insurance “tax”.
Quadric offers an alternative. The Quadric Chimera GPNPU (general purpose NPU) runs complete deep-learning graphs. The entire graph, all layers. Including NMS, ArgMax, Softmax and other control code style layers. In fact, you can also run virtually all forms of random Python code (written in C++ custom operator kernels) that might be embedded in the latest model produced in PyTorch. And Chimera GPNPUs also run many common forms of pre-processing and post-processing found in complex signal chains.
With a Chimera GPNPU, your design doesn’t need a large, general-purpose applications CPU as the insurance policy. The Chimera core is the insurance policy. Instead of the high-royalty rate applications-class CPU, you can use a simple real-time microcontroller core as the system host to initialize memory and peripherals and run a simple driver to kick-start the Chimera core. Once running, the Chimera core is completely self-contained and it puts zero workload on the system host microcontroller.
An SoC design employing a Chimera GPNPU will run any network. All networks. All operators. See for yourself in our online Developer Studio. We have classic networks (Resnet, Mobilenet), newer detectors and classifiers (Centernet, BlazePose), Transformers (ViT, LLM) and Mediapipe suite networks. And the Benchmark performance figures will astound – Chimera delivers full programmability with the MAC efficiency and throughput of a dedicated “accelerator”. For more information, visit Quadric.io.
Leave a Reply