Some designs focus on power, while others focus on sustainable performance, cost, or flexibility. But choosing the best option for an application based on benchmarks is becoming more difficult.
Every new processor claims to be the fastest, the cheapest, or the most power frugal, but how those claims are measured and the supporting information can range from very useful to irrelevant.
The chip industry is struggling far more than in the past to provide informative metrics. Twenty years ago, it was relatively easy to measure processor performance. It was a combination of the rate at which instructions were executed, how much useful work each instruction executed, and the rate at which information could be read from, and written to, memory. This was weighed against the amount of power it consumed and its cost, which certainly were not as important.
When Dennard Scaling declined, clock speeds no longer increased for many markets and MIPS ratings stagnated. Improvements were made elsewhere in the architecture, in the memory connection, and by adding more processors. But no new performance metrics were created.
“For the better part of the last two decades there has been a creepy silence,” says Ravi Subramanian, senior vice president and general manager for Siemens EDA. “That silence was created by Intel and Microsoft, which controlled the contract that existed between computer architecture and the workload running on it, the application. That has driven a large part of computing, and especially the enterprise. We now have some very specific types of compute, which are more domain-specific or niche, that broke away from traditional von Neumann architectures. The millions of operations per second per milliwatt per megahertz had been flattening out, and in order to get much greater computation efficiency, a new contract had to be built between the workload owner and the computer architect.”
It became important to consider the application when attempting to measure the qualities of a processor. How good was this processor performing a particular task, and under what conditions?
GPUs and DSPs started the industry down the path of domain-specific computing, but today it is being taking to a new level. “As classic Moore’s Law slows down, innovation has shifted toward domain-specific architectures,” says James Chuang, product marketing manager for Fusion Compiler at Synopsys. “These new architectures can achieve orders of magnitude improvement in performance per watt on the same process technology. They open a vast unknown space for design exploration, both at the architecture-level and the physical design-level.”
There have been attempts to define new metrics that mimic those from the previous era. “AI applications require some specific capabilities in a processor, most notably large numbers of multiply/accumulate operations,” says Nick Ni, director of product marketing for AI and software and solutions in AMD’s Adaptive & Embedded Computing Group. “Processors define the trillions of operations per second (TOPS) that they can execute, and those ratings have been increasing rapidly, (shown in figure 1). But what is the real performance in terms of performance per watt, or performance per dollar?”
Fig. 1: Growth in AI TOPS ratings. Source: AMD/Xilinx
With chip sizes reaching the reticle limit, it becomes more expensive and difficult to include additional transistors onto a die, even with process scaling, and so performance gains only can come from architectural changes or new packaging technologies.
Multiple smaller processors often are better than a single larger one. Bringing multiple dies together in a package also allows the connection to memory and to other computation cores to undergo architectural improvements, as well. “You might have multiple processing units joined together in a package to provide better performance,” says Priyank Shukla, staff product marketing manager at Synopsys. “This package, which will have multiple dies, will work as a bigger or more powerful compute infrastructure. That’s system is providing a sort of Moore’s Law scaling that the industry is used to seeing. We are reaching the limit where an individual die will not provide your performance improvement. But now these are the systems that are giving you the performance improvement of 2X in 18 months, which is what we are used to.”
Workloads are driving new requirement in computer architectures. “These go beyond traditional von Neumann architectures,” says Siemens’ Subramanian. “Many of the new types of workloads need analysis, and they need to create models. AI and ML have become essentially the workforces to drive model development. How do I model, based on training data, so that I can then use the model to predict? That’s a very new type of workload. And that is driving a very new view about computer architecture. How does computer architecture mate with those workloads? You could implement a neural network or a DNN on a traditional x86 CPU. But if you look at how many millions of operations per milliwatt, per megahertz, you could get, and consider the word lengths, the weights, the depth of these, they can be far better delivered in a much more power efficient way by mating the workload to computer architecture.”
The workloads and performance metrics differ depending upon location. “The hyperscalers have come up with different metrics to benchmark different types of compute power,” says Synopsys’ Shukla. “Initially they would talk about Petaflops per second, the rate at which they could perform floating point operations. But as the workloads have become more complex, they are defining new metrics to evaluate both hardware and software together. It’s not just the raw hardware. It’s the combination of the two. We see them focusing on a metric called PUE, which is power usage effectiveness. They have been working to reduce the power needed to maintain that data center.”
What has been lost is the means to compare any two processors, except when running a particular application under optimal conditions. Even then, there are problems. Can the processor, and the system in which it is used, sustain its performance over a long period of time? Or does it get throttled because of heat? What about when multiple applications are running on the processor at the same time, causing different memory access patterns? And is the most important feature of a processor outside of a data center its performance, or is it battery life and power consumption, or some balance between the two?
“If you step back and look at this at a very high level, it’s still about maximum compute capability at the lowest power consumed,” said Sailesh Chittipeddi, executive vice president and general manager of Renesas’ IoT and Infrastructure Business Unit. “So you can think about what kind of computing capabilities you need, and whether it is optimized for the workload. But the ultimate factor is that it still has to be at the lowest power consumption. And then the question becomes, ‘Do you put the connectivity on-board, or do you leave it outside. Or what do you do with that in terms of optimizing it for power consumption. That’s something that has to be sorted out at the system level.”
Measuring that is difficult. Benchmark results are not just a reflection of the hardware, but associated software and compilers, which are a lot more complicated than they have been in the past. This means performance for a particular task may change over time, without any change in the underlying hardware.
Architectural considerations do not stop on the pins of a package. “Consider taking a picture on an advanced smartphone,” says Shukla. “There is AI inference being performed in the CMOS sensor that captures the image. Second, the phone has four cores for additional AI processing. The third level happens at the data center edge. The hyperscalers have rolled out different levels of inferencing at different distances from the data capture. And finally, you will have the really big data centers. There are four levels where the AI inferencing happens, and when we are accounting for power we should calculate all of this. It starts with IoT, the phone in your hand, all the way to the final data center.”
With so many startup companies creating new processors, it is likely that many will succeed or fail because of the quality of their software stack, not the hardware itself. Adding to the difficulties, the hardware has to be designed well in advance of knowing what applications it may be running. In those situations, there is nothing to even benchmark the processor against.
Benchmarks
Benchmarks are meant to provide a level playing field so that two things can be directly compared, but they remain open to manipulation.
When a particular application becomes significant enough, the market demands benchmarks so that they can be rated. “There are benchmarks for different types of AI training,” says Shukla. “ResNet is the benchmark for image recognition, but this is a performance benchmark, not a power benchmark. Hyperscalers will show the efficiency of their compute based on hardware plus software. Some even build custom hardware, an accelerator, that can execute the task better than a vanilla GPU, or vanilla FPGA-based implementation. TensorFlow is one example coupled with the Google TPU. They benchmark their AI performance based on this, but power is not part of the equation as of now. It’s mostly performance.”
Ignoring power is a form of manipulation. “A 2012 flagship phone had a peak clock frequency of 1.4GHz,” says Peter Greenhalgh, vice president of technology and fellow at Arm. “Contrast this with today’s flagship phones which reach 3GHz. For desktop CPUs the picture is more nuanced. While Turbo frequencies are only a little higher than they were 20 years ago, the CPUs are able to stay at higher frequencies for longer.”
But not all benchmarks are of a size or runtime complexity to ever reach that point. “As power is consumed the temperature rises,” says Preeti Gupta, head of PowerArtist product management at Ansys. “And once it goes beyond a certain threshold, then you have to throttle back the performance, (as shown in fig. 2). Power, thermal, and performance are very tightly tied together. Designs that do not take care of their power efficiency will have to pay the price in terms of running slower. During development, you have to take real use cases, run billions of cycles, and analyze them for thermal effects. After looking at thermal maps, you may need to move part of the logic in order to distribute heat. At the very least, you need to put sensors in different locations so that you know when to throttle back the performance.”
Fig. 2: Performance throttling can affect all processors. Source: Ansys
Over time, architectures optimize for specific benchmarks. “Benchmarks continue to evolve and mirror real-world usage, which can be relatively easy to create and deploy using well-established methodologies at the system software level, or at the silicon testing stage,” says Synopsys’ Chuang. “However, analyzing is always after the fact. The bigger challenge in chip design is how to optimize for these benchmarks. At the silicon design phase, common power benchmarks are typically represented only by a statistical toggle profile (SAIF) or a very short sample window — 1 to 2 nanoseconds of the actual activity (FSDB). Instead of ‘what to measure,’ the bigger trend is ‘where to measure.’ We are seeing customers pushing end-to-end power analysis throughout the full flow to accurately drive optimization, which requires a consistent power analysis backbone from emulation, simulation, optimization, and sign-off.”
Benchmarks can identify when there is a fundamental mismatch between the application and the hardware architecture it is running on. “There can be major dark silicon when you are running realistic workloads on some architectures,” says AMD/Xilinx’s Ni. “The problem is really the data movement. You are starving the engine, and this results in a low compute efficiency.”
Even this does not tell the whole story. “There are an increasing number of standard benchmarks that a consortium of people agree to,” adds Ni. “These are models people consider state-of-the-art. But how effective are they at running the models you may care about? What is the absolute performance, or what is your performance per watt, or performance per dollar? That is what decides the actual OpEx of your cabinets, especially in the data center. The best performance or power efficiency, and cost efficiency, typically are the two biggest care-abouts.”
Others agree. “From our perspective there are two metrics that are growing in importance,” says Andy Heinig, group leader for advanced system integration and department head for efficient electronics at Fraunhofer IIS’ Engineering of Adaptive Systems Division. “One of them is power consumption or operations per watt. With increasing costs for energy, we expect this will grow in importance. A second growing metric is resulting from the chip shortage. We want to sell products with the smallest numbers of devices, but with the highest performance requirements. This means that more and more flexible architectures are necessary. We need a performance metric that describes the flexibility of a solution regarding changes for different applications.”
A key challenge in chip design is that you don’t know what the future workloads will be. “If you don’t know the future workloads, how do you actually design architectures that are well mated to those applications?” asks Subramanian. “That’s where we’re seeing a real emergence of computer architecture, starting with understanding the workload, profiling and understanding the best types of data flow, control flow, and memory access that will dramatically reduce the power consumption and increase the power efficiency of computing. It really comes down to how much energy are you spending to do useful computation, and how much energy are you spending moving data? What does that overall profile look like for the types of applications?”
Excellent article Brian, which creates a great lens or framework with which to examine, say the current spate of contemporary articles on Apple’s latest CPU vs Intel and AMD. I would love to see a specific article on “non Van Neumann” machines and architectures, as I’ve been been interested in such for nearly 20 years, since I had UC Berkeley’s Bob Broderson talk to our Intel design community on his home grown FPGA based completely reconfigurable architecture/non Van Neumann students’ project.
heat dissipation is a big deal in eliminating the fan with graphine coating of the core blocks power is greatly reduced so another big thing to look for is graphine usage and adoption.