The Murky World Of AI Benchmarks

What works for one application may be wholly inadequate for another; accuracy may vary by use case.


AI startup companies have been emerging at breakneck speed for the past few years, all the while touting TOPS benchmark data. But what does it really mean and does a TOPS number apply across every application? Answer: It depends on a variety of factors.

Historically, every class of design has used some kind of standard benchmark for both product development and positioning. For example, SPECint, dhrystone, and geekbench are used as a set of benchmark programs for processor performance. RFC 2544 is used to benchmark networking ASICs. But the development of more customized and heterogeneous hardware, coupled with continuous updates in AI algorithms, has made decisions based on benchmarks murky at best.

“In recent years, we have seen a lot of advancements in hardware choices available for AI – from GPUs to FPGAs to custom hardware ASICs,” said Anoop Saha, market development manager at Mentor, a Siemens Business. “However, the existing benchmarks are not suitable for measuring how the hardware will work for AI applications – either in training or in inference.”

Currently, TOPS (tera operations per second) is the most common metric used for describing the performance of many hardware design. But buyer beware.

“There are a zillion companies out there doing inference, and they make a lot of claims,” said Steve Teig, CEO of Perceive. “Many of those claims are overblown. When people say, ‘I have this many TOPS and this many TOPS per watt,’ I encourage them to poke at that. I do that when I look at other people’s claims. I think, ‘You claim you run this network at this speed, and you can look up on the web how many ops is this network at that speed. You could figure it out then determine how many TOPS they really run rather than how many they say.'”

Understanding what makes up a TOPS number requires understanding of what exactly is being tested. But that does little for providing the context for how a device will behave in the real world.

“TOPS measures only the performance of a chip,” said Saha. “Power consumption is a bigger challenge in acceleration of neural networks. That is because of the highly parallel nature of these algorithms, which also have significant memory accesses. A better metric is TOPS per watt — performance per unit of energy. Indeed, most datasheets of AI hardware now describe both TOPS and TOPS per watt as a metric, and both are important. TOPS per watt is even more important for edge devices compared to data center.”

A TOPS number also can vary depending on whether it is measuring fixed or floating point operations, according to Pulin Desai, group director for product marketing, management and business development at Cadence. “For fixed-point operations, usually people will use tera ops numbers. Sometimes they may throw in some floating point operations and refer to giga ops, which is floating point operations per second. A third thing engineering groups say is, ‘I don’t care about raw capacity. You have actual computational capacity. I want you to tell me if I ran this neural network, how many inferences per second I can get?’ They are looking at the different classification networks, including ResNet and others, that classify the images. They want to know how many inferences per second can we run for ResNet. Every user may have their own favorite network that is based on their use case. So somebody wants to classify, somebody wants to detect objects, somebody want to do something else. They will give those networks and then ask what we can do. There, it boils down to system-level performance versus IP-level performance.”

Looking deeper
To understand a TOPS number requires some investigative work. For example, what type of operations are included in that number? If a product is running at 1GHz with 2,048 8-bit operations, that equals 2,048 times 1,024 giga ops, or 2 TOPS. But if it’s really 4-bit operations, the number of 4-bit operations might be able to be doubled and it may be advertised as 20 TOPS.

“Generally, we tend to stick to 8-bit because most of the inference happening today is on 8-bit,” Desai said. “And that’s a multiply accumulate operation. But now insert a DSP, where there are a lot of parallels operations. If there is another operation in parallel, because it’s a parallel device, you have to add up the 2,048 multiplier accumulator operations plus this other 8-bit operation. More simply, I say the number of MAC operations and number of 8-bit operations. You add those up and then multiply by the frequency that you’re running, which gives the TOPS.”

This kind of detail is essential to determine exactly what is being measured. “If somebody was giving you the tera ops numbers, then what type of data types is it — 2-bit, 4-bit, 8-bit, 16-bit? What did you use? If you calculated tera ops, what frequency did you assume? And is your device only doing this operation, or are there some other operations happening in parallel? If they gave the information on a particular network, then the same question applies. ‘Did you use fixed point to give me this number? Did you give me the floating point to give me the number? What frequency did you run? Is there a specific memory bandwidth you assumed?’ And eventually the customer will ask about power. What’s the power consumption when you are running that? So it’s the raw data, and on top of that frames per second, and then power,” he said.

This is far from an objective analysis, however. Performance can vary depending on the optimizations done on the network, such as pruning, quantization and compression. Many machine learning models can provide similar accuracy with a low precision integer model instead of using floating point operations. So a TOPS per watt for a hardware using a 32-bit-integer precision would be different from that with 8-bit-integer precision. Depending on the accuracy they require, the user can decide on the accuracy level of quantization in their mode.

The bigger question is which applications should be run to provide accurate benchmark metrics. “For a while now, common networks like MobileNet, ImageNet and ResNet are used for those metrics,” Saha said. “What is needed is a standardized set of benchmark applications, which are not only fair and exhaustive, but which have a common dataset to avoid any variations that can happen from the datasets. MLPerf is an industry initiative aimed at developing a standardized benchmark applications for AI hardware, both for training and inference. Tiny-MLPerf is a sub-group, which is trying to come up with a benchmark targeted toward edge devices, almost entirely in inference.”

Benchmarking in context
In all of this it is helpful to remember this is a relatively immature market developing literally by the day, noted Andrew Grant, senior director for artificial intelligence at Imagination Technologies. “Things are changing all the time, so for any sort of chip architect or SoC architect they’ve got a lot of stuff to take in and try to understand. If you look at mature markets like CPUs where CoreMark and things like that are quite important — or if you look at GPUs, where there are a whole slew of benchmarks that have been established over the years — you just take that piece of software, run that test for a few minutes, and out pops all the scores. Because these are markets that have been around for 20 years, that’s much easier.”

With AI, there are different benchmarks to look at different aspects of performance. “But even then, you end up with a situation where if you’re looking at say, a mobile GPU, then how that particular GPU performs over a period of time and what its drain on the battery is could be really important. But that isn’t taken into account by the pixel drawing aspects of that benchmark,” Grant explained. “When we look at the AI benchmarks themselves, remember that the whole artificial intelligence, machine learning and neural network space is changing all the time. Every single day something pops in with a new development, which is what makes it such an exciting space to work in. But for the areas that we work in, we hear everything from floating point performance to quantized inferences per second and things like that. TOPS may be one of the most commonly used ways of denominating the performance of something, but it’s not a complete figure. In order to really understand the performance of any chip or any architecture, you need to know what the performance is, at what clock speed, on what nodes, i.e., 16nm, or 7nm, or 5nm, or even an earlier version at 28nm or 40nm, because all of that will affect chip size in the ultimate SoC. So it’s quite a tricky thing to look at.”

To make matters more confusing, different benchmarks apply at different levels in the development of an SoC. In Imagination’s case, because its IP will then go into someone else’s SoC, a lot of the benchmark activity is related directly to a particular network under a particular set of circumstances. “We are often asked whether we can run a representative network of what the customer is likely to actually use, and then we will work through and run that on a GPU or run that on our neural network accelerator. And so, interestingly enough, if we look at our neural network accelerator, we might talk about it being a 10 TOPS accelerator. But when you want to then look at how it actually performs under certain circumstances with a certain network, then other things may become more important,” Grant said.

Beyond MLPerf, a number of organizations are seeking to establish a more comprehensive basket as a benchmark. This is similar to what happened in the CPU and GPU space. For instance, does a lot of work on networks that already have made their way into silicon.

What is needed today is a common understanding of what the different areas of AI benchmarks mean, and what the implications are for unbalanced benchmarks. “There are examples in GPU-land where manufacturers end up having to chase a particular benchmark, because that’s one of the common benchmarks that is quoted, although it may no longer be relevant for the users. You end up in that rather strange situation where the numbers that have been displayed on the billboard are great, but the user actually wants something entirely different. We’re keen to ensure that sort of thing doesn’t happen to the same extent in the neural network accelerator and compute space. People will look at floating point performance at, say, FP 32 or FP 16. In the case of our neural network accelerator, it’s fixed point and it’s a quantized performance. That’s a different animal, so having those sorts of numbers and making sure that tests are run that are relevant and regular, is a really important aspect of that,” Grant said.

Fig. 1: Neural network

Benchmark accuracy
Beyond all of this is the accuracy of the AI benchmark results. “This is a key point that was introduced in the latest version of AI-Benchmark,” said Salvatore De Dominicis, application engineering manager for AI at Imagination. “When you’re analyzing a quantized network, there is a very high risk of losing precision and getting a result that is not actually accurate unless the IP is designed in a way to preserve precision and accuracy when it’s needed. This is something that is not really a factor when you’re talking about CPU or GPUs, because you mostly look at performance there. There is something on image quality for the GPU. But when you’re handling neural network workloads, a key point is that you also want to protect the numerical accuracy. That’s a new concept.”

Accomplishing all of this, while also maintaining performance and low power consumption, is non-trivial. “This is why the benchmark vendors like AI-Benchmark actually pushed the introduction of this new category,” De Dominicis said. “Basically, the score is heavily influenced by the quality of the results, which is in the optic of the whole market being quite new. So the situation is not stable. Things are moving. Initially there were a lot of solutions that had at first looked like they had very good performance, but the accuracy was very bad, which limited the practical usability of some products.”

To make IP more accurate requires designing the hardware so that the internal precision is kept high when necessary, De Dominicis noted. “The problem then becomes understanding when it’s necessary, because you obviously don’t want to pay the cost everywhere. Otherwise, you will have increasing area and power consumption. The idea is that you want to have a good final accuracy. This means that the output of the network is as close as possible to the theoretical output. We provide a set of tools that can help our customers to modify the network so that we can basically recover some of the noise produced by these mathematical operations and keep the accuracy high even when we go to lower resolutions (as in lower number of bits) for the operation. Again, the landscape is changing a lot.”

Implications of architectural choices
Once the applications and the metrics are well-defined, designers face a critical choice. “What are the levers they can play with to get optimal performance and power metrics? Choosing the right architecture is extremely vital here, because they have the biggest ROI for improving results,” Saha said. “Here, too, one size does not fit all. A designer will need to align the goals and use cases to select the right architecture. Whether it is generalized versus specialized, is it targeted toward data centers or at the edge? Will it be prominently used for training vs inference? That will help decide the right approach.”

Finding the optimal architecture requires running the benchmark applications as early as possible, getting accurate performance and power numbers, and having the ability to tweak the architecture quickly. And then, repeat everything. In other words, system architects need to take a build-measure-learn approach to create a use case-based optimized architecture for the AI hardware.

“There is a special class of application specific hardware accelerators, typically used in edge or IoT devices,” Saha said. “The use cases would be limited to a few specific neural network algorithms. Making the right architecture choices become even more vital for these accelerators, which must be tuned for the application. For example, in a deep neural network, different layers might need different architectures. Some layers might have a millions of weights, which need to be stored in off-chip RAMs. Other layers might have a different size or type of convolutions. A general-purpose hardware architecture would not be efficient because it will not achieve 100% utilization as well as require redundant read/write from system memories. Typically, data locality will improve power consumption significantly.”

Looking ahead
Within this rapidly evolving area, the challenges are constantly evolving. A network may not be the right one, or there might be a newer one coming.

“The newer one may require more processing capacity, or they may require some different kind of things,” said Cadence’s Desai. “Development teams end up looking at two levels. First, they make sure they have a scalable solution because they need flexibility or future proofing. This is about having the ability to respond to the changes in the marketplace by having a flexible architecture. It could be flexible in terms of programmability or how they make it. It could be scalable. Second, as the new network comes in, how quickly can you take the network definition and convert into a code or a program that can start working on your device, whether it’s an IP or otherwise? That’s where the network compiler comes in that takes the network definition from Caffe2 or TensorFlow or Pytorch. You push a button and hopefully you have a executable that’s optimized for your architecture, so as new network comes in the field or your scientists come up with a new network, you can deploy very quickly what you are looking for.”

Leave a Reply

(Note: This name will be displayed publicly)