Getting The Power/Performance Ratio Right

Concurrent power and performance analysis is essential, but the path there is not always straightforward.

popularity

Getting to market quickly means determining as soon as possible if a concept for a new design will work or not, particularly where power and performance are concerned. Making this determination requires intimate knowledge of the scenarios in which the device will operate — and that is just the start.

In order to set things up, you need to somehow model the system, which could be done in a spreadsheet. But it’s typically better to do it in something that’s more executable, noted Drew Wingard, CTO of Sonics. “That model is designed to give you some feedback on how well your current architecture is going to be able to handle that case.”

For the scenarios that engineering teams use for trying to analyze performance, and provision the throughputs, and latencies of the system — there is about an 80% overlap between those scenarios and the ones they use for optimizing power and energy, he pointed out. “It’s difficult enough to build these models – it takes enough time to do that that it seems very attractive to be able to reuse the same basic model for both power and performance.”

Another approach is use blanket margining and siloing of metrics, observed Tobias Bjerregaard, CEO of Teklatech, and assuming we are doing away with those, power and performance are inseparable. “Except for a few special cases, today it is not really about achieving a certain performance or a certain power. It is about getting the right performance/power ratio. How much computational performance do you get out of a chip per watt of power? As such the two cannot be seen as separate measures. The performance-to-power trade-off has always been there. Now it is just much more pertinent, as power is an important competitive measure. In mobile it is about battery life. In other cases it is about getting the chip to work in an inexpensive package. We see some design groups fighting for the last percent of total power.”

Further, he asserted, performance has a real impact on power throughout the backend flow, as cell swapping, drive-strength reduction, and other path-level power optimization methods are used to trade timing for power. The better timing QoR the design has coming in, the better power it will have coming out. As such the two must be analyzed together.

Power and performance analysis must be concurrent, as voltage drop directly affects timing, Bjerregaard explained. “That is especially critical at advanced nodes, where the performance penalty caused by voltage drop can be quite extreme due to the nature of finFET transistors in conjunction with operating at very low supply voltages. Since the penalty is high, a single dynamic voltage drop target cannot be applied to the entire chip. Voltage drop does not have the same effect everywhere. The effect of the voltage drop on path timing is what matters. Hence, voltage drop along the critical paths, where it has a real impact on timing, is more serious than voltage drops else where.”

What is required is for the designer to analyze power and the resulting voltage drop, and apply that actual voltage drop to the timing extraction. The alternative is to over-design and lose potential reduction of power along non-critical paths, where drive-strength reduction and cell swapping could have been applied, he added.

The power modeling challenge
Describing the performance and power scenarios, and defining what success would look like, typically falls to the system architect.

“Power modeling is still not an easy thing to do,” said Wingard. “The models aren’t readily available. It’s not that they are difficult to create. It’s that they are difficult to characterize because the actual power dissipation has a lot to do with the functional mode of the block. So it’s an interesting characterization exercise to try to figure out how to characterize those models. People have tended to do very simple things with spreadsheets. ‘I’ve got this many thousands of gates clocking at this rate, and I’m going to assume a certain activity factor because at this process node I’m going to have that much leakage.’ You can do some pretty simple algebra to come up with a baseline power number. But the real key factor is what is the actual activity inside the circuit, and there’s no substitute there for having realistic traffic.”

He noted that for the most part, experience is a good starting point. “Let the past be your guide. The important point is characterizing what you’ve already built so that you can at least start off with a more accurate idea than licking your thumb and putting it up in the air. You can start with something that is far more realistic, even if you’re going to tweak that block or subsystem for the next chip just by looking at what you did last time.”

That doesn’t mean using traditional methods, though, where numbers are manually entered into a spreadsheet or transaction-level models are built based on estimates from architects. Bryan Bowyer, director of engineering at Mentor Graphics, said engineering teams are looking for rapid proof-of-concepts of new designs to concurrently explore performance and power. As such, they need to decide on the implementation well before any RTL coding can take place. But they still need accurate data.

Here is where high-level synthesis tools can come in handy because they can be used to quickly generate RTL for multiple algorithms and architectures. Bowyer noted this is not about small, micro-architecture tradeoffs. Instead, these companies are trying out multiple implementations of new algorithms, exploring the performance and power of implementing these algorithms in hardware or software, and examining the tradeoffs of running on an ASIC, FPGA, or even a CPU or GPU. Tradeoffs are also explored by varying the number of hardware cores and experimenting with memory and FIFO sizes.

The typical approach for rapid proof-of-concept and exploring performance and power tradeoffs is to perform experiments in the high-level synthesis tool, automatically generate the RTL for a target technology, and then use an RTL synthesis tool to get final area information. The teams then simulate the netlist to collect switching activity to perform power analysis. This flow provides much more accurate data than estimating power based only on experience with previous designs, and it is fast enough to use while making decisions about algorithms and architectures.

Nvidia used Mentor’s technology on its real-time superpixel accelerator design to explore algorithms and architectures. The goal was to optimize area and energy efficiency, while still achieving real-time performance. To analyze the performance and power concurrently the team used high-level synthesis to generate multiple implementations, then used power estimation data based on switching activity to make architectural tradeoffs.

At the same time, it’s important to note that if performance and power analysis were added to the SoC development schedule as wholly independent endeavors, the expense would be tremendous, perhaps prohibitive, noted Steve Carlson, product management group director at Cadence. “Instead, design teams have leveraged the huge investment in the functional verification environment to serve as the foundation for other analyses. This serves not only to amortize the cost of verification, but also unifies device configuration, test setup, and reporting practices.”

Additionally, as the use of complex power management capabilities are embedded into SoCs, the accompanying impact on system performance becomes more complicated to predict. Design teams have responded by incorporating extensions to the normal functional verification/regression process that keep tabs on power and performance. At the SoC-level, the broad adoption of software-driven hardware testing has proven extremely valuable for examining performance and power characteristics under actual end user scenarios such as playing Angry Birds, and smooth swipes while web browsing, he pointed out.

Again, concurrent timing and power analysis tools have been available for some time to meet power and timing closure goals, and allows engineering teams to analyze leakage and dynamic power in the context of timing corners, said Mary Ann White and Bernadette Mortell, both product marketing directors in the Design Group at Synopsys. To perform the analysis, an additional power analysis command is included in the timing analysis run. Typically the engineer responsible for timing is also responsible for power closure. “The advantage of the integrated solution is that setup is done once, a common data set is used, and reporting is consistent for both power and performance. Since not all timing corners are critical for power analysis, the engineer can customize the analysis to focus only on the power critical timing corners.”

When it comes down to it, concurrent power and performance analysis is required because otherwise, a performance change might have a significantly detrimental effect on power, or vice-versa, noted Peter Greenhalgh, an ARM fellow and director of technology for the company’s CPU Group. “There are few markets in the world that would, for example, tolerate a 10% power impact for 1% performance. Instead, micro-architects strive toward 1% performance for 1% power — or better if that’s possible.”

The challenge and approach to this will depend on the level of design, he explained. “For example, at the instruction set architecture level, determining which are the optimal instructions requires software profiling, compiler optimizations and an understanding of how a range of micro-architectures might decode and execute the instructions. Adding instructions that aren’t commonly used is a waste of decode space, as well as power in the decoders and other associated structures”

At the other end of the spectrum, the RTL level, designers need to measure both power, timing and performance with each build of the design, he said. “While different EDA tools are used for each they are all run on the same version of the design to ensure that each RTL change is of value.  It’s hopefully intuitive that some of the optimizations that could be made to improve timing, such as logic duplication and/or wider execution cones, will result in higher power. There’s no magic to solving these challenges other than designer experience, innovation and lots of trials,” Greenhalgh said.

Related Stories
Tech Talk: Power Reduction
Why getting granular about energy can yield huge savings and how to utilize idle time.
How Cache Coherency Impacts Power, Performance
Part 1: A look at the impact of communication across multiple processors on an SoC and how to to make that more efficient.
Analyzing The Integrity Of Power
Making sure the power grid is strong enough to sustain the power delivery.
Reaching The Power Budget
Why power is still a problem, how it will get worse, and what can be done about it.
SoC Power Grid Challenges
How efficient is the power delivery network of an SoC, and how much are they overdesigning to avoid a multitude of problems?
Implementation Limits Power Optimization
Why dynamic power, static leakage and thermal issues need to be dealt with throughout the design process.
Designing SoC Power Networks
With no tools available to ensure an optimal power delivery network, the industry turns to heuristics and industry advice.