The Problem With Benchmarks

What makes a good benchmark and who should create it? This is an issue the industry has been slow to address, but progress is being made.

popularity

Benchmarks long have been used to compare products, but what makes a good benchmark and who should be trusted with their creation? The answer to those questions is more difficult than it may appear on the surface, and some benchmarks are being used in surprising ways.

Everyone loves a simple, clear benchmark, but that is only possible when the selection criteria are equally simple. Unfortunately, that is rarely the case. Benchmarks often favor an entrenched product or existing architecture over a newcomer, simply because it has had more chance to optimize for that benchmark. Think about how long MIPS (millions of instructions per second) was used as the sole gauge for the performance of a processor. The fact that there was no standard for instructions meant it was arbitrary at best, and few people select a device based on a single metric.

Today, performance for a defined application, per Joule, per dollar, may be a better metric, but it lacks simplicity or universality. In addition, the total processing capability of a system is defined by many things other than the processor. Memory actually may be the largest contributor — especially when it comes to machine learning systems. Compiler tool chains also can have a huge impact.

Some people prefer benchmarks that target components rather than systems. “Traditionally, synthetic benchmarks have been used to measure processor performance by providing a framework where everyone has access to a common set of code with which to test their systems,” says Rod Watt, director of applied technology for Arm’s Automotive and IoT Line of Business. “Although these tests will provide an indication of processor capabilities, they are in no way a substitute for running real-life workloads.”

The same type of issue pertains to machine learning processors. “Customers are helped by benchmarks that provide some insight into which acceleration solution can give them the best performance for their neural network model within their dollar and power budget,” says Geoff Tate, CEO of Flex Logix. “Consider that an 8-core processor doesn’t run 8 times faster than a single-core processor. How much faster it runs depends on things like cache hit rates and resolution of bus access contention and shared memory access contention, which are very hard to model. So there needs to be a range of benchmarks that vary depending on the type of application the customer plans to do: the benchmark needs to have computation/operator types and image/memory sizes that are similar to what the customer plans to deploy/develop.”

Processors can be tough to benchmark because there are so many variables. “Synthetic benchmarks tend to concentrate on core performance; some actually running entirely from the cache, without stressing the rest of the system,” adds Arm’s Watt. “It’s very difficult to gauge how it will run in real life when it has to deal with factors such as memory bandwidth, I/O latency, power consumption, and thermal issues. This is particularly important in the areas of ML and IoT, where the system is dealing with large and varied data sources. Utilizing a test and measurement methodology that takes this into consideration is key in getting a reliable indication of the system’s performance.”

Others have similar observations. “Today’s SoCs are becoming software workload driven,” says Ravi Subramanian, vice president and general manager for Siemens EDA. “So instead of computer architects working in isolation, looking at a set of things, you actually have to look at the workloads, which are going to be the benchmarks for power and performance and drive the SoC architecture.”

But it goes further than that. The toolchain is an important part of the ecosystem, and that includes compilers. David Patterson, professor of computer science at the University of California, Berkeley, recently gave the keynote at the Embedded Vision Summit. He compared GCC to LLVM, and also compared an Arm core to a RISC-V core when using the same compiler. He found that the compilers were having a bigger impact on performance than the ISAs. “The lessons for embedded benchmarking is that code size has to be shown with performance. So far, none of the embedded benchmarks include the code size to get meaningful results. It is also important that we include geometric standard deviation, as well as geometric mean in the results. More mature architectures have more mature compilers, which helps them. But newer architectures will catch up.”

Roddy Urquhart, senior marketing director at Codasip, provides a concrete example of this. “Consider compiling the CoreMark benchmark with different switches using the common GCC compiler. Figure 1 shows CoreMark/MHz and code size for different compiler settings. The last example is one that is typical of vendor performance data, where many switches are used for CoreMark (CM = ‘-O3 -flto -fno-common -funroll-loops -finline-functions -falign-functions=16 -falign-jumps=8 -falign-loops=8 -finline-limit=1000 -fno-if-conversion2 -fselective-scheduling -fno-tree-dominator-opts -fno-reg-struct-return -fno-rename-registers –param case-values-threshold=8 -fno-crossjumping -freorder-blocks-and-partition -fno-tree-loop-if-convert -fno-tree-sink -fgcse-sm -fgcse-las -fno-strict-overflow’),” he explained. “In this example, the CoreMark/MHz score grows as the switches change from left to right. However, it is interesting to note that the most complex set of switches increases the code size by 40% over ‘‑O3’ while the performance only improves by 14%.”

Fig 1. CoreMark performance for different compiler optimizations. Source: Codasip.

Fig 1. CoreMark performance for different compiler optimizations. Source: Codasip.

While this may not be important for some applications, it is a central issue for embedded systems. “People are designing hardware and software for antiquated things, using the wrong benchmarking technology,” says Patterson. “This bothered some of us so much we decided to try and fix it. We have created an organization called Embench, that tries to be better for embedded computing.”

Keeping benchmarks up to date is also important. “The problem is that common benchmarks like ResNet-50 may not be a good indicator,” says Flex Logix’s Tate. “The reason is that ResNet-50 is an old benchmark that no one actually plans to use. As an old benchmark, it’s ‘native image size’ is just 224×224 pixels, whereas customer’s image sensors generate megapixel images, and megapixel images will give much more accurate results. ResNet-50 does not stress the memory subsystem of an inference chip and so can give misleading indications compared to a benchmark that stresses the memory subsystem, such as YOLOv3, which uses larger images and larger intermediate activations.”

This poses a challenge. “All parts matter because they are coupled,” says Tim Kogel, principal engineer at Synopsys. “It does not make sense to come up with the beautiful hardware architecture and not have a good ML compiler that takes advantage of all the features in your hardware. When talking about algorithms, which includes the quality of the data and the architecture of the network itself, it is the domain of the data scientists. When looking at it from the perspective of the semiconductor or system house building an inference chip, it is about mapping these algorithms and running them. It is the toolchain and the hardware together that are responsible for the quality of results and how well the metrics are met. Plus, the combination of metrics is different for different applications. ML can be applied in many areas, and all of them have different requirements for performance, power and accuracy. It is difficult to come up with one that fits all.”

Who creates the benchmarks?
At first it may seem obvious who should create a benchmark and who has a conflict of interest, but it’s not always that clear. “There is no way to create an objective, universal benchmark, from a user nor from a vendor perspective,” says Juergen Jaeger, product management group director at Cadence. “Both sides have different objectives. If I am a vendor, I want to influence the benchmark in the way that my product looks good and the competition’s product looks bad. If I am a user, then I am looking for a benchmark that allows me to make a decision about which product is best for my end user needs.”

That would seem to suggest the user should put together the benchmarks. “Not so fast,” cautions Jaeger. “Users want to put a benchmark together that reflects their use cases and the things that are important to them. For some users, it is all about performance. For others, it’s about how easy is it to use. Some only care about pricing. What are the right criteria for a benchmark and how do you rank them in order of priority? As a vendor we see a much larger variety of designs styles, language coverage, how customers are putting the code together, etc. These can have a big impact on the results, and the benchmark shouldn’t be biased towards a particular design style or circuit topology.”

Vendors often utilize benchmarks to help themselves improve, as well. “Companies often balance efforts between internal benchmarking and industry benchmarking,” says Dylan Zika, AI/ML product manager at Arm. “Internal efforts focus on improving the processor IP for the needs of specific customers, while industry benchmarking efforts improve processor IP for the broad needs of the industry. In order to achieve this balance in a cost-efficient way, we need industry-wide support to create benchmarks, datasets, and best practices to empower the whole industry. Working collaboratively can be a powerful enabler of improved business performance, but successful collaboration rarely emerges out of the blue and should not be taken for granted.”

Benchmark trap
Some benchmarks have had a negative impact on the industry. After everyone designs and optimizes for those benchmarks, the benchmark becomes useless and the products may have been optimized for something that is irrelevant. “The industry will of course try to optimize – everyone will try to optimize their tool or product for that benchmark,” says Cadence’s Jaeger. “In the end, you will have a situation where after a little bit of time, when the waveform settles down, everything performs equally well. Then what do you do as a user? They all look the same. They all perform the same.”

The graphics industry is very aware of this problem. “What you don’t want to see is what we have seen in graphics where a particular benchmark becomes dominant because everyone has heard of it and you find companies designing to get the best score on that benchmark,” says Andrew Grant, senior product director at Imagination Technologies. “That distorts what they do. What you need is a basket of relevant benchmarks that get updated over time that the industry can understand and work with, but that do not distort the picture. We should not be trying to drive AI forward by looking in the rear-view mirror optimizing stuff that was important years ago.”

Some industries have done a better job creating benchmarks than others. “Good benchmarks are very common in the PC industry,” says Jaeger. “You have Geekbench for all kinds of systems. In the mobile space, there are things like AnTuTu. In those industries this works. These publicly available benchmark suites are also used by our customers in part to determine the performance of EDA tools.”

There are other cases where benchmarks are useful outside of their intended audience. In its Q4 2020 newsletter, a team working on an update to EEMBC’s IoTMark-WiFi benchmark came up with a surprising discovery. While the standard focused on the battery life of IoT devices, the benchmark exposed vendor-specific variations among access points, such as routers. Some routers caused end devices to consume more power than others if certain configuration options were not set.

There are an increasing number of organizations who are taking up the challenge of creating benchmark suites that focus on a variety of industries. In a recent blog, Arm’s Zika, talked about MLCommons. “MLCommons is a global engineering nonprofit which successfully employs a holistic approach to measuring performance, creating datasets and best practices. The benchmarking group enable open and transparent consensus with competing entities to create a fair playing field. And they are supported by the 30+ founding members from commercial and research communities. Their practices enforce replicability to ensure reliable results and are complementary to micro benchmark efforts. MLCommons is keeping benchmarking efforts affordable, so all can participate to help grow the market and increase innovation together.”

It is important that these benchmarks and the environments they operate in are kept up-to-date. Consider the transition of EDA tools going into the cloud. “Getting data in and out of the cloud can significantly impact the efficiency of a tool flow,” says Jaeger. “Consider the size of waveform files being generated by emulators, or the amount of data created for power analysis. We are now trying to put more intelligence into the machine. We have additional CPUs, and we sometimes call it edge processing in the machines, so we can offload things. In the past when we probed data, we would transfer the raw data to the host. Now, we do data compression within the machine. This helps with the communication, it saves storage, it also actually helps the host workstation creating the waveform, because now the waveform is built from a smaller amount of data. The change of the environment influences the benchmark and that, in return, influences the product itself.”

Conclusion
It takes a dedicated group of diverse people to create and maintain a benchmark suite. The industry is finding those benchmarks need to become more attuned to the needs of particular market segments and reflect the things that are important to them. The notion of universal benchmarks is no longer valid, except in limited cases. System-focused benchmarks are becoming more important, but also a lot more difficult to put together in objective ways.



1 comments

BillM says:

ah benchmarks. Look no further than how Volkswagen optimized their EPA testing (a form of benchmarking) to void scrutiny on their diesel based cars…optimizing their system to recognize a standard EPA test for emissions…

End users should always have their own benchmarks based on their needs…

Leave a Reply


(Note: This name will be displayed publicly)