Looking Beyond TOPS/W: How To Really Compare NPU Performance

How the underlying test conditions come into play when looking at benchmark data.

popularity

There is a lot more to understanding the true capabilities of an AI engine beyond TOPS per watt. A rather arbitrary measure of the number of operations of an engine per unit of power, the TOPS/W metric completely misses the point that a single operation on one engine may accomplish more useful work than a multitude of operations on another engine. In any case, TOPS/W is by no means the only specification a system architect or chip designer should use in determining the best AI engine for their unique application needs. Any experienced system designer will tell you that factors such as software stack, chip size/area requirements, fitness for specific neural networks, and others are often more important.

Still, TOPS/W is often one of the first performance benchmarks provided by NPU (Neural Processing Units) makers, and it can be useful if considered in the context of the underlying test conditions. Unfortunately, there is no standardized way of configuring the hardware or reporting benchmark results, and this makes it challenging when comparing different NPUs. Let’s explore how the underlying test conditions come into play when looking at TOPS/W benchmark data.

Frequency as a variant in TOPS/W

The TOPS metric is defined according to the formula: TOPS = MACs * Frequency * 2. Let’s start with the frequency portion of the equation. There is no ‘best’ or ‘most correct’ frequency that an NPU maker should use. Frequency is as much dependent on the process node as it is on the NPU design. Consider a 54K MAC engine at 1GHz produces 108 TOPS, while the same 54K MAC engine at 1.25GHz yields 135 TOPS. Increasing the frequency results in more TOPS, but at the cost of disproportionally increased power consumption. In the case of chip-based NPUs, the maximum processing frequency will be fixed, while for IP-based NPUs frequencies will depend on the actual silicon implementation. When comparing benchmark results from different vendors, it’s important to normalize for any differences in the frequency.

Neural network(s) employed as a variant in TOPS/W

After normalizing for frequency, we next need to examine the Neural Network(s) used for testing. Neural networks have a wide swath of characteristics.

The figure above highlights how five commonly used neural networks have huge differences in processing. For example, while Yolo V3 has about 60% of the operations of Unet, it has more than 30X the weights. Consequently, processing requirements (and performance) for each will vary. Run through the same engine, these two networks will produce different results. Run through competing engines, the two networks will likely produce greatly varied results that are not necessarily indicative of actual competitive performance. To get an accurate head-to-head comparison of AI engines, we need to run identical networks.

Precision as a variant in TOPS/W

It is equally important to consider the precision of those network(s) used to benchmark performance. An engine running a neural network at an INT4 level of precision will only need to process half the data compared to an INT8 precision, leading to a much-decreased processing load. Ideally, NPU providers would report data for the same level of precision for all layers of the network. In practice, some engines will decrease precision in certain processing-intense layers to stay within power envelopes. While this approach is neither inherently correct nor incorrect, it can be confusing for benchmarking purposes, as an AI engine that lowers precision in one or more layers can give the appearance of processing advantages versus another engine that employs a higher level of precision throughout the network.

Sparsity and pruning as a variant in TOPS/W

The next question that needs to be considered is whether sparsity or pruning are used. Both sparsity and pruning are generally accepted methods that avoid calculations based on 0 values in multiplication equations or remove redundancies in the least important parts of models. Using these techniques can lead to significant PPA (performance, power, area) gains—some networks allow 30% sparsity or more, albeit with (smaller) negative effects on accuracy. If unreported in the assumptions of TOPS/W, the previous example would provide an artificial 30% performance advantage versus another NPU where 0 sparsity is used on the same network.

Process node as a variant in TOPS/W

We now need to consider the process node as it relates to power consumption. One of the major advantages of moving to smaller process nodes is smaller nodes generally require less power for the same action. For instance, a processor in TSMC 5nm will likely consume about 25% less power than the same processor in TSMC 7nm. This is true for both IP and chip-based NPUs. As process node has a powerful influence on performance, it must be considered when comparing benchmark data. It is important to ask an NPU supplier whether the power numbers are measured in actual silicon, extrapolated from actual silicon, or based on simulation. If actual silicon was used, then the vendor should indicate which process node, and that process node should be factored into the comparison.

Memory power consumption as a variant in TOPS/W

A hot topic within the AI industry is how to account for the power consumption of memory in a TOPS/W conversation. Some AI providers do, while others do not. Regardless, all AI engines require memory—some require huge amounts of memory compared to others. A reasonable strategy is to think about how much memory is optimal for your application and normalize reported results accordingly, and insure that power consumption is factored.

Utilization as a variant in TOPS/W

As we can see, it is not easy to assess and compare the performance potential of an NPU. And still, there is one more factor that is often overlooked: processor utilization. Like all processors, NPUs are not 100% utilized. This means that for a certain percentage of the time, processors are idle while awaiting the next data set as that it shifts around in memory. While processors may have ‘sweet spots’ within their performance range where utilization is high, overall, this tends to vary wildly within the performance range. Having high utilization is a good thing because it minimizes idle time and stalls which can impede performance. At the same time, it reduces the power and area required. This is because, in principle, a processor running at 90% utilization can do the work of two comparable processors running at 45% utilization.

Finding the right NPU for your application

It would certainly be easier if all NPU chip and IP makers declared TOPS/W specifications in a standard manner, or at least did so including all the underlying configuration information discussed above. In the absence of that, those evaluating NPUs should demand this level of disclosure from all the NPUs they are considering.



Leave a Reply


(Note: This name will be displayed publicly)