Does Your NPU Vendor Cheat On Benchmarks?

Your spreadsheet of numbers doesn’t tell the whole story.

popularity

It is common industry practice for companies seeking to purchase semiconductor IP to begin the search by sending prospective vendors a list of questions, typically called an RFI (Request for Information) or simply a Vendor Spreadsheet. These spreadsheets contain a wide gamut of requested information ranging from background on the vendor’s financial status, leadership team, IP design practices, production history and most importantly: performance information – aka “benchmarks.”

For the vast majority of categories of IP, these benchmarks are well understood and therefore quite useful to gather. Collecting jitter specs on an analog I/O cell or counting clock cycles for a specific 128-pt complex FFT on a DSP leaves little wiggle room for a vendor to shade the truth. But when it comes to benchmarks for machine learning inference IP (usually called an “NPU” or “NPU accelerator”), the spreadsheet benchmark game is a completely different story!

Two major roadblocks to good benchmark comparison

There are two common major gaps in collecting useful “apples to apples” comparison data on NPU IP: [1] not specifically identifying the exact source code repository of a benchmark, and [2] not specifying that the entire benchmark code be run end to end, with any omissions reported in detail.

Which model?

The “original sin” of asking for benchmark data from an IP vendor is to not specifically give all the vendors the source code of the ML graphs that you want benchmarked. Handing an Excel spreadsheet to a vendor that asks for “Resnet50 inferences per second” presumes that both parties know exactly what “Resnet50” means. But does that mean the original Resnet50 from the 2015 published paper? Or from one of thousands of other source code repos labeled “Resnet50”? The original network was trained using FP32 floating point values. But virtually every embedded NPU runs quantized networks (primarily INT8 numerical formats, some with INT4, INT16 or INT32 capabilities as well), which means the thing being benchmarked cannot be “the original” Resnet50. Who should do the quantization – the vendor or the buyer? What is the starting point framework – a PyTorch model, a TensorFlow Model, TFLite, ONNX – or some other format? Is pruning of layers or channels allowed? Is the vendor allowed to inject sparsity (driving weight values to zero in order to “skip” computation) into the network? Should the quantization be symmetric, or can asymmetric methods be used?

If Vendor X aggressively simplifies the benchmark network using the above techniques, but Vendor Y does not, is the buyer still getting a fair yardstick for comparison? Are the model optimization techniques chosen reflective of what the actual users will be willing and able to perform three years later on the state-of-the-art models of the day when the chip is in production? Note that all of these questions are about the preparation of the model input that will flow into the NPU vendor’s toolchain. The more degrees of freedom that a buyer allows an NPU vendor to exploit, the more the comparison yardstick changes from vendor to vendor.

Fig. 1: So many benchmark choices.

The entire model?

Once you get past the differences in source models comes the dual questions of [a] does the NPU accelerator run the entire model – all layers – or does the model need to be split with some graph layers running on other processing elements on chip, and [b] do any of the layers need to be changed to conform to the types of operators supported by the NPU?

Most NPUs implement only a subset of the thousands of possible layers and layer variants found in modern neural nets. Even for old benchmark networks like Resnet50 most NPUs cannot perform the final SoftMax layer computations needed and therefore farm that function out to a CPU or a DSP in what NPU vendors typically call “fallback.”

This Fallback limitation magnifies tenfold when newer transformer networks (Vision Transformers, LLMs) that employ dozens of NMS or SoftMax layer types are the target. One of the other most common limitations of NPUs is a restricted range of supported convolution layer types. While virtually all NPUs very effectively support 1×1 and 3×3 convolutions, many do not support larger convolutions or convolutions with unusual strides or dilations. If your network has an 11×11 Conv with Stride 5, do you accept that this layer needs to Fallback to the slower CPU, or do you have to engage a data scientist to alter the network to use one of the known Conv types that the high-speed NPU can support?

Taking both of these types of changes into consideration, the buyer needs to carefully look at the spreadsheet answer from the IP vendor and ask, “Does this Inferences/Sec benchmark data include the entire network, all layers? Or is there a workload burden on my CPU that I need to measure as well?” Power benchmarks also are impacted: the more the NPU offloads back to the CPU, the better the NPU vendor’s power numbers look – but the actual system power numbers look far, far worse once CPU power and system memory/bus traffic power numbers are calculated.

Gathering and analyzing all the possible variations from each NPU vendor can be challenging, making true apples to apples comparisons almost impossible using only a spreadsheet approach.

A different approach

Not only is the Quadric general purpose NPU (GPNPU) a radically different product – running the entire NN graph plus pre- and post-processing C++ code – but Quadric’s approach to benchmarks is different also. Quadric pushes the Chimera toolchain out in the open for customers to see at www.Quadric.io. Our DevStudio includes all the source code for all the benchmark nodes shown, including links back to the source repos. Evaluators can run the entire process from start to finish – download the source graph, perform quantization, compile a graph using our Chimera Graph Compiler and LLVM C++ compilers, and run the simulation to recreate the results. No skipping layers. No radical network surgery, pruning or operator changes. No removal of classes. The full original network with no cheating. Can the other vendors say the same?



Leave a Reply


(Note: This name will be displayed publicly)