Create your own benchmark models for NPU vendor evaluation.
In our last blog post, we highlighted the ways that NPU vendors can shade the truth about performance on benchmark networks such that comparing common performance scores such as “Resnet50 Inferences / Second” can be a futile exercise. But there is a straight-forward, low-investment method for an IP evaluator to short-circuit all the vendor shenanigans and get a solid apples-to-apples result: Build Your Own Benchmarks. BYOB!
Does that sound like a daunting task? Especially all the “data science” parts needed to ensure high inference accuracy results? Stop to consider that at the early stages of winnowing the field of possible NPU vendors – and you do need to winnow, given that a dozen or more IP companies and silicon companies are offering their wares as intellectual property cores – you don’t need to focus on absolute inference accuracy as much as you need to judge key metrics of [1] performance; [2] memory bandwidth; [3] breadth of NN model support; [4] breadth of NN operator support; and [5] speed and ease of porting of new networks using the vendors’ toolsets.
All machine learning models begin life in a model training framework using floating point data representation. The vast majority of published reference models are found in their native FP32 format – because it’s easy for the data science teams that invent the latest, greatest model to simply publish the original source. But virtually all device-level inference happens in quantized, INT-8 or mixed INT-8/16 formats on energy-optimized NPUs, GPNPUs, CPUs and GPUs that perform model inference at 5X to 10X lower energy than a comparable FP32 version of the same network.
Quantizing FP32 models to INT8 in a manner that preserves inference accuracy can be a daunting prospect for teams without spare data scientists idling at the coffee pot looking for something to keep them busy. But do you really need a lovingly crafted, high-accuracy model in order to judge the merits of 10 vendor products clamoring for your attention? In short, the answer is: No.
First, recognize that the benchmarks of today (late 2023, early 2024) are not the same workloads that your SoC will run in 2026 or 2027 when it hits the market – so any labor spent fine-tuning model weights today is likely wasted effort. Save that effort for perhaps a final step in the IP vendor selection, not early in the process.
Secondly, note that the cycle counts and memory traffic data that will form the basis for your PPA (power, performance, area) conclusions will be exactly the same whether the model weights were lovingly hand-crafted or simply auto-converted. What matters for the early stages of an evaluation are: Can the NPU run my chosen network? Can it run all the needed operators? What is the performance?
As we covered in the previous blog, many IP vendors may have laboriously hand-tuned, hand-pruned and twisted a reference benchmark beyond recognition in order to win the benchmark game. You can short-circuit that gamesmanship by preparing your own benchmark model. Many training toolsets have existing flows that can convert floating point models to quantized integer versions (e.g TensorFlow QAT; Pytorch Quantization). Or you can utilize interchange formats, such as ONNX Runtime, to do the job. Don’t worry about how well the quantization was performed – just be sure that all layers have been converted to an integer format.
Even if all you do is take a handful of the common networks with some variety – for example: Resnet50, an LLM, VGG16, Vision Transformer, and a Pose network – you can know that the yardstick you give to the IP vendor is the same for all vendors. Ask them to run the model thru their tools and give you both the performance results and the final form of the network they ran to double-check that all of the network ran on the NPU – and none of the operators had to fallback onto a big, power-guzzling CPU. Better yet – ask the vendor if their tools are robust enough to let you try to run the models yourself!
BYOB will allow you to quickly ascertain the maturity of a vendor toolchain, the breadth of operator coverage, and the PPA metrics you need to pare that list of candidates down to a Final Two that deserve the full, deep inspection to make a vendor selection.
If you’d like to try running your own benchmarks on Quadric’s toolchain, we welcome you to register for an account in our online Developer’s Studio and start the process! Visit us at www.quadric.io
Leave a Reply