The Next Big Leap: Energy Optimization

What’s required to optimize your design for energy? The simple answer is a new EDA flow that goes from conception to implementation.


The relationship between power and energy is technically simple, but its implication on the EDA flow is enormous. There are no tools or flows today that allow you to analyze, implement, and optimize a design for energy consumption, and getting to that point will require a paradigm shift within the semiconductor industry.

The industry talks a lot about power, and power may have become a more important design metric than performance in some markets. Power is important because knowledge about it can be used to correctly size the power distribution network. It also can help predict thermal issues and provide guidance for many types of optimizations.

A lot of times we talk about power because we know how to measure, analyze, and optimize it. But the reality is that what many people really care about is energy, and that presents a lot more challenges.

“Multiple design houses have told us they want to do analysis for energy, not just power,” says Qazi Ahmed, principal product manager for the Calypto group of Mentor, a Siemens Business. “Power has become a first-class metric. In fact, it has just toppled performance as the primary metric for a design goal. But the real goal is to develop IPs that are energy-efficient. In design, energy efficiency may or may not always be equal to low power.”

Power tells you how much energy is being consumed per unit of time. When doing power optimization, attempts are made to remove unnecessary activity, and this is good. But it cannot tell you if the energy spent was useful or if the same task could have been performed using less energy.

“We talk a lot about power, often as a proxy for energy, and occasionally forget the difference,” says James Myers, distinguished engineer at Arm. “The difference, of course, is integrating power over time — but how much time spent doing what?”

Power, energy, and performance are intertwined, often in complex ways. “While power is a key measure of how efficiently a design uses the available energy, overall energy consumption determines whether a design can operate with the desired performance within the thermal constraints,” says Arti Dwivedi, senior manager, product management at Ansys. “Maximizing design performance requires maximizing energy efficiency.”

What is missing in the definition of power is what constitutes a useful task. Once that is defined, it becomes possible to analyze how much energy was consumed performing that task. Now it becomes possible to tell if one architecture or implementation produces the same result more efficiently. How much energy is your system wasting on housekeeping functions? Do you actually reduce total energy by using a smaller, slower processor rather than running the same task on a faster processor? Does that processor extension allow your software to become more energy efficient?

The focus on power permeates through the development process. When you run place-and-route, you are primarily optimizing for performance. But how different would the layout be if you were optimizing for power? And how would it change again if you were optimizing for energy? The difference between optimizing for power and energy means that all tools would need to become task-driven. That requires understanding which tasks are most important to the device, and then using that information to ensure those tasks consume the minimum amount of energy.

This approach requires a deep collaboration with the ecosystem. “This is not trivial,” says Rob Knoth, product management director at Cadence. “The easiest thing that many of us have been doing is attacking the problem indirectly. Rather than identifying units of work, what we’re doing is more pervasively trying to optimize power, because we have those tools today. We do not waste work by optimizing power. At the end of the day, when we do identify those units of work, we’re going to need all these same tools — tools that we built into the flow that we are using to pervasively optimize power.”

This can get very complicated just on the power side. “There are several scaling vectors of interest in assessing and projecting power during the architecture phase,” says Dan Cermak, vice president of architecture and product planning for Ambiq Micro. “There is architectural scaling to account for new architectures and design features such as frequency changes, new hardware functions such as accelerators, power domain partitioning, and potentially voltage changes. There is process scaling to account for new or updated process parameters to determine Ceff (effective capacitance), wire loading effects, VT, voltage shifts, etc. Then there are design-related optimizations to take into account. All of these scaling vectors need to be assessed in the context of representative workloads.”

What is missing is an industry standard way to define the tasks, scenarios, and workloads that are important to a system being designed. The Portable Stimulus Standard (PSS) is an attempt to define that capability. It is a high-level testbench language based on control and data flow through a design. But it is unclear at this point whether the standard is deficient in some way, making it too difficult to perform this role, or if it is just taking time to become accepted within the industry. The goal of PSS was to have a single way to define testbench scenarios that could be used throughout the development flow, because the input description was agnostic about the execution engine the design was to be run on.

Energy vs. power
Energy encompasses both active and leakage power. “Mobile and IoT devices are typically heavily duty cycled, so standby power is important as this will integrate over long standby times,” says Arm’s Myers. “But even in IoT, the active power and compute throughput can be as important. For example, executing TinyML neural networks for voice or image classification. Increased power here will be an energy win if the time to result is reduced by a larger amount, and this is why we are seeing continually increased processing capability in these devices.”

There are other ways to get to extremely low power device operation. “We can design at near-threshold voltages to take advantage of square law power reduction,” adds Myers. “But it’s possible to lower voltage and frequency to such a point that while power is decreased, active energy ends up increasing due to lower leakage over much longer time.” (See figure 1.)

Fig. 1: Power versus energy considerations. Source: Arm

Fig. 1: Power versus energy considerations. Source: Arm

Tradeoffs between energy and power can be non-intuitive even when concentrating on active power. “If you have an SoC with two cores — a high-performance core and a low-performance core — the high-performance core does more work and consumes more power,” says Mentor’s Ahmed. “The low-performance core may have 50% of the throughput compared to the high-performance core, and may consume 30% to 40% less power. In this case, the low-performance core is not as energy-efficient as the high-performance core, and running a task on that core will result in lower power but more total energy.”

The challenge is translating this into a design. “You need a tremendous amount of high-quality data about the system to analyze and drive exploration and implementation,” says Cadence’s Knoth. “If you don’t have that data, you’re going to make very short-sighted decisions, which are potentially erroneous. This is because you may be dealing with a local minima as opposed to a global minima.”

Knowing the relationship between power and energy can help with improvements around a minima. “Power regressions for different workloads with varying utilizations are being adopted in power methodologies to identify power bugs, which lead to redundant energy consumption,” says Ansys’ Dwivedi. “Yadong Wong from Qualcomm shared their methodology of using differential energy analysis with the same test, but different workloads to measure change in energy consumption and identify design inefficiencies. An increase in energy consumption of the design with the same test, but lower utilization, indicates redundant switching of data and clocks when no useful work is being done.”

Energy drivers
There are certain markets that will drive this. “They’re the ones who are going to invest in it,” says Knoth. “When we started originally talking about power, as opposed to just frequency, the cellphone chips were driving that and the people building data center servers didn’t care because they were plugged into a wall. They didn’t have that little battery to constrain them. But now, the data center is worried about the amount of cooling they need. And if they can optimize the power efficiency on one of the chips, when they multiply that by the thousands, it’s going to have a material impact on their operating costs.”

One common component between markets are the processor cores. “The focus on energy is primarily being driven by IP vendors,” says Ahmed. “There are CPUs and GPUs. There are people working on machine learning and AI accelerators, and network companies — anybody who has a large design operating with different types of modes and who wants to get low power, energy efficiency, or because they need to meet environmental requirements.”

A key driver is the ability to set metrics for a processor. “It could be looking at instructions and how much work is being done per watt,” Ahmed explains. “You could concentrate on different operations like arithmetic operations, and you can actually look at the utilization and the amount of power they consume. So people can plot something like energy linearity checks, which basically means how much energy is being consumed for a given performance or utilization. For 100% utilization, a certain amount of energy might be consumed. If you reduce the operations, CPU performance may be reduced to 50%. Is the energy still 50% or 60%? There could be different ways to do that.”

Defining tasks, scenarios and workloads
One of the difficulties is that modern SoCs rarely perform one task at a time. When multiple tasks are operating on a device, they interact with each other. The question then becomes how can you define the energy being consumed by a specific task. How much additional energy is being consumed by its interactions with other tasks? Without this knowledge, it is difficult to know if running them in parallel is the right choice or if they should be run serially, assuming no other constraints.

“The same is true for scaling components of our systems,” says Myers. “Larger systems may create performance and energy bottlenecks in other components. Assumptions can be verified with existing power analysis tools toward the end of the design flow, but earlier insight would be very beneficial.”

Use cases matter, too. “It is likely that people would start measuring power consumed by each task under ideal conditions,” says Ahmed. “Then they may have different scenarios where somebody is playing a game while watching a video, and at the same time in the background some other app is running, as well. Or maybe the device is doing two or three different things, so the combined scenario needs to be there. There has to be a way to run a large number of workloads, and then make decisions for powers.”

The scenarios have to be long enough, such that any heat created by running the scenario can be taken into account. For example, while a game may start out consuming a certain amount of energy per minute of play time, it may increase as the device heats up, causing additional energy to be consumed.

Representative workloads are important. “Assuming the workloads are known — which is a huge assumption since this is typically one of the most difficult aspects of power analysis — the next challenge is how to effectively predict/model these scaling vectors to estimate power for a given workload,” says Ambiq’s Cermak. “Probably the easiest method, or at least the most accessible, is using a spreadsheet model or similar. These models tend to be extremely complicated and unwieldy. Yet, when properly managed, they can be very effective.”

There are a lot of moving pieces to understand, though. “This is all complicated by the time and energy to transition between operating modes, whether standby to active and back, or between DVFS operating points,” says Myers. “Consider the path from a triggering event, through system control processor, to voltage regulator output changes, through power gate controls, following any macro-specific control sequencing, releasing clocks and resets, and then we’re ready to go. How long does this take, and how much energy is consumed? How often do we want to make such changes? This is not covered in standard benchmarks that focus on active power and avoid device-specific power management, though ULPMark Core Profile is a notable exception in the IoT domain.”

It all comes back to defining representative workloads. “You’re looking at how to effectively use functional verification to drive implementation and optimization,” says Knoth. “If we’re talking about climbing the pyramid, where the top is energy, we’re getting pretty close. When we’re talking about units of work, we have to be talking about the functionality of the system. We have to be talking about what the widget is doing. And so there’s a broad recognition that there needs to be a pervasive use of functional verification in concert with the design realization.”

Tool requirements
While still somewhat academic, tool vendors are attempting to address the issue of energy. “For each use case, they need an energy number, as well as the power numbers,” says Ahmed. “Then they can do an overlay and try to extract information through data analysis. What people want to see is detailed reporting with powerful visualizations so that what they see at the end is meaningful. There’s a need to have some standard intelligence built into the tools for that.” (See figure 2.)

Fig. 2: Building energy intelligence into tool flow. Source: Mentor, A Siemens Business

Fig. 2: Building energy intelligence into tool flow. Source: Mentor, A Siemens Business

Cadence is approaching the problem with three steps, according to Knoth. “The first is understanding, the second is exploration, and the third is implementation. Understanding is critical before you start doing any work. It’s critical that the whole ecosystem takes a step back and says, ‘For this thing that I’m building, I need to understand its function. What are the workloads?’ Then we can start to explore with things like high-level synthesis, or early prototype RTL synthesis, RTL power estimation, etc. You spend a lot of time in the exploration stage, trying different architectures, trying different data flows, trying different components that go into the product. Then you get to implementation, where we continue using the same engines that were used in the exploration phase. We’re using the same stimulus that enabled us to understand the design. We use that stimulus to drive all of the synthesis, and place-and-route. We’re choosing the right architecture and micro-architectures, we’re optimizing the clock network, etc.”

The quantity of analysis involved is much higher than in the past. “You might have a design that has 1,000 different use scenarios, and some might be more important, some less,” says Ahmed. “We need to get the power numbers and the energy metrics for all of them, and somehow have the ability to generate an average for all of those scenarios. Then you need to feed that back, in a meaningful way, to the RTL designer to help them focus on optimizing for power that will result in attaining energy efficiency.”

The back-end tools have to change, as well. “Most tools are currently built for performance optimization,” adds Ahmed. “Place-and-route has to be driven from an energy efficiency point of view rather than performance. None of the downstream physical tools have the capability to do any routing or placement from the perspective of power or energy. That still needs to be built in. It will require new kinds of technologies, new methods, and new kinds of integration with upstream tools.”

That integration with the upstream tools is important. “During the design phase, physical design specific detail is unknown,” says Cermak. “Clock trees do not exist, wire loading is unknown, and intrinsic effects of gate delays/propagation are unclear. However, there needs to be some way to effectively project power to feed back any issues that may require architectural changes and additional design optimizations. Generally speaking, these tools are wildly inaccurate in predicting physical design effects, and either end up radically pessimistic or optimistic, depending on the design’s complexity.”

While power optimization has been an important step forward for the industry, it is not the top of the pyramid. The industry has started to assess how it gets to being energy-aware, but that is not going to be an easy change to make. We have started to look at power from a task, scenario, and workload perspective, but the industry has to agree on the ways that this is going to be accomplished. If it is not going to use PSS, it needs to quickly work on an alternative. This is a gating function.

The industry then must make a concerted effort throughout the development flow, because without all stages of the flow being made energy-aware, accuracy will suffer. That means the industry will be slow to adopt it. Accuracy has held back power optimization for quite some time, and users in general still find large gaps between what was predicted and what turned out to be true in silicon. Maybe a focus on energy will lead to a greater understanding and more predictability.


Kevin Cameron says:

After 20+ years of trying to get Verilog-AMS into the mainstream of EDA for combined timing and power design closure, I can fairly confidently say we are little closer now than a decade ago – as far as the big EDA companies are concerned.

On the plus side: that means anyone who does have the tools has a distinct advantage in designing the bleeding edge circuits needed for AI and HPC.

Leave a Reply

(Note: This name will be displayed publicly)