Lots Of Little Knobs For Power

A growing list of small changes will be required to keep power, heat, and noise under control at 10/7nm and beyond.

popularity

Dynamic power is becoming a much bigger worry at new nodes as more finFETs are packed on a die and wires shrink to the point where resistance and capacitance become first-order effects.

Chipmakers began seeing dynamic power density issues with the first generation of finFETs. While the 3D transistor structures reduced leakage current by providing better gate control, the fins made it more difficult to dissipate heat. That problem has only gotten worse at 10/7nm, and leakage current is beginning to waft up again.

As a point of reference, it’s possible to pack 3 transistors into the same amount of space at 5nm as 2 transistors would require at 7nm, according to IBM. At 7nm, 20 billion transistors would take up an area about the size of a fingernail.

That makes power one of the top issues to solve, and there is no simple way to make that happen. But unlike at past nodes, there is no single big knob to turn. Even with gate-all-around FETs, dynamic power density will continue to result in thermal problems, which can affect signal integrity, reliability, and performance, among other things. That puts the onus on design teams to develop sophisticated power management schemes that take enough real-world use cases into account to avoid problems.

“With the hundreds of millions of dollars it takes to design at these advanced technology nodes, profits are dependent on first-silicon success,” said Preeti Gupta, director of RTL product management at Ansys. “Power methodologies are evolving to meet these growing needs.”

At the foundation of all of these approaches is much earlier analysis and verification, and better planning for yield and test at the conceptual stage of designs.

“Early analysis enables high-impact architectural decisions, guides key downstream decisions such as power delivery network design, and identifies power issues early enough so that significant design changes can still be made without a severe impact on schedule,” Gupta said. “In fact, early RTL power analysis and reduction methodologies are being adopted by the hitherto power-insensitive networking applications, as well as by those who are now worrying about cooling costs and rack budgets.”

Another emerging trend that is quickly becoming mainstream is early power and thermal profiling of system-level application activity. In the past, methodologies have focused on short-duration activity for power, but they run the risk of missing power-critical events that may occur when the chip is exposed to its real traffic such as OS boot-up or a high definition video frame, Gupta explained. “High-performance engines now can profile hundreds of milliseconds of activity coming from emulators early at RTL and identify power problems and critical peak and di/dt scenarios that can impact the power supply.”

Noise issues
Feature shrinks are causing other issues, as well. As the metal density increases and wires become increasingly thin, there is less space between components and less insulation everywhere. That raises the level of noise of all sorts, some of which shows up as crosstalk.

“In the past, design teams survived ignoring the inductance completely,” said Magdy Abadir, vice president of marketing at Helic. “They may have analyzed inductance effects within very small components such as RF components or certain SerDes blocks or the PLLs. Within small blocks they may do analysis to make sure that the design of these specific components is fine. But worrying about interference between different blocks, or across different hierarchies, has never been considered until perhaps a little bit before 16nm—but definitely at 16nm, 10nm and now 7nm. It’s now becoming a necessity, not a luxury. Everybody has to worry about this.”

Compounding this problem are rising data rates and clock frequencies. “You would not design something at these nodes and have it operate slowly,” Abadir said. “The whole idea is to get more performance. So the higher the frequency, the more the electromagnetic waves are generated as a result of signals changing. The faster they change, the stronger the electromagnetic waves.”

Small changes add up
However, incorporating these concerns into a design flow and shifting a methodology is a non-trivial amount of work. So wherever possible, engineering teams are utilizing a variety of technologies that allow definitive power improvements to be achieved with minimal disruption, and augmenting that flow with smaller changes.

Luke Lang, product engineering director for low power products at Cadence, pointed to three major areas where improvements can be made with minimal disruption. “Number one, you’ve got to start doing a good job in power estimation. It’s no longer automated now. You look at your chip to determine where it is burning the most power. Then you must figure out how to change the architecture to reduce the power because the automation is pretty much all done.”

Second, he said the best results come from not manually tuning RTL, but instead from tools like high level synthesis, where an algorithm can be modeled in a high-level language like C++ or SystemC. Tools here contain compilers that will look at a design and try different scenarios to reduce power.

Lang points to a JPEG design to explain how this technology is being used. “It’s a circuit that handles JPEG. We run it through high-level synthesis with various parameters and microarchitectures, and we will generate 61 RTLs. Then you run the power estimation analysis to find out which one is best based on whatever criteria you have. In a typical design team there is no way you can code 61 different versions of RTL. You don’t have the resources and you don’t have the schedule.”

A third area for squeezing power out of a design is to examine the impact of software on power. “With early generation smartphones, the number one complaint was the battery didn’t last very long. Firmware upgrades helped make the batteries last longer. The hardware didn’t change, but the software reduced power usage. It’s getting so hard to squeeze more power out and you just can’t ignore the software.”

Part of the problem is that the software is scheduling different activities, and things are happening at different times. In most cases, perhaps two, three or four functions don’t all need to happen at the same time. By spreading them out or prioritizing the scheduling, power can be further reduced.

“We really need to shift to look at the design team,” said Lang. “Did you write the best RTL possible because downstream we’ve done all the squeezing possible? We really need to squeeze the dynamic power, and we need to look at the design. Did you architect your design where it is switching a lot? Can you change your design, re-architect it or modify it so that it switches less but still does the same function?”

Software-defined hardware
This is the whole idea behind software-defined hardware, where the functionality of the software is used to develop the most efficient hardware to run that software. Rather than building a standard processor and relying on the operating system to connect everything together using application programming interfaces, the hardware and software are both tweaked to improve performance, lower power and, at least in theory, be more secure.

There is wide industry consensus, much of which harkens back to a 2009 ISQED presentation (see fig. 1 below), that once the RTL is coded, 80% of the power is locked in. This brings the problem up to the level of the application, which is what the architect can see.


Fig. 1: Power saving potential throughout the design flow. Source: Accellera/ISQED 2009

“This is where the architect can understand where the tradeoffs really lie,” said Drew Wingard, CTO of Sonics. “So at the level of the application they can say, ‘Yes, it’s okay to sacrifice the response time of this part of the design because the user is never going to notice,’ and, ‘No, it’s not okay to sacrifice responsiveness over there.’”

Stuart Clubb, a product marketing manager at Mentor, a Siemens Business, agreed. “If you just take an RTL design and say to the engineers, ‘Don’t really change anything, just try and throw some clock gating in there,’ then you’re kind of done because we’re concerned about dynamic power. It’s not like you can say, ‘Hey, you can save power by turning it off,’ because it has to do something. You’re going to assume that if it’s not needed to do anything, you’ve already turned it off. This starts getting into architectural considerations. Many times engineering teams are necessarily pushing the frequency capabilities of the process technology, and we’ve seen this actually in some customer designs where, for any process technology, you can take a piece of RTL and synthesize it for a low speed, and it will be small area. Then you can basically decrease the clock period until it starts getting bigger because these synthesis tools are more area-centric.”

Wingard believes one barrier to improving this situation is that many engineers still approach power architectures with a spreadsheet. It’s not uncommon for them to think about some of the key operating scenarios they want to support at the chip level, then come up with an estimate of how much power is going to be used by the major building blocks of the chip. Then they try to determine how much power management might be needed to get the associated energy down to something that’s appropriate for the use case.

“The spreadsheet models are relatively crude and the number of these cases is relatively large, so what often happens is the chip team is left with no choice but to kind of wing it. ‘I think I should use this power saving technique over here. I’ll just do some kind of clock gating. And over here I’ll do some power gating. Then here I’ve got this processor complex so I think I really want to apply dynamic voltage and frequency scaling over there.’ And there really is not any effective feedback loop that lets the designer know if they are making good choices,” he said.

Still, managing dynamic power through a number of different steps will add up to the big power knob to turn for the time being, Clubb said. “To my mind, it’s definitely dynamic power because all of the other stuff of turning down the voltage when you don’t need to run it fast, and turning it off completely when you don’t need to use it at all — that’s already done. We’re not worried about leakage current so much. FinFET, after we got away from planar, or FD-SOI, we pretty much threw out the concerns about leakage anyway. But dynamic unfortunately went up big time with finFET, so it’s more the case of every superfluous piece of switching.”

Like Cadence’s Lang, he pointed to technologies like high-level synthesis, which have come to the fore in terms of being able to determine where there is switching activity that is superfluous—or conversely, that there is no more power to be saved by throwing in clock gating or the usual tricks.

“You can scream at an RTL team to your heart’s content to reduce power, and it doesn’t necessarily mean it’s going to happen,” Clubb noted. “And that knob, while it’s being turned by some, is kind of like the very top of the mountain of corporate attitude. There are still way too many design teams who are focused on functionality. ‘Does the thing actually work? Did I get coverage? I don’t want to change the code because then I’ll have to re-do my verification, and the boss is screaming that the regression runs take too long, anyway, in RTL. So I don’t want to change that unless somebody comes to me and says this is going to save a ton of power.’ The problem is somebody comes to you and says, ‘Here are 10 things that you can do to each save a little bit of power. Each on their own, they’re not really much. But when you add them up, they start to make a difference.”

This pressure is getting pushed to the RTL engineers, but it’s not particularly onerous. “In order to get power estimates, you need to run a simulation, and you need switching data,” he said. “You’re already doing that. It’s called functional verification.”

Power estimation doesn’t give the absolute power signoff number, but it can get within 10% to 15% of gate-level simulation much more quickly. And this rough power estimation is useful to drill down through the hierarchy to get down all the way to an individual operator. “I have this multiplier or adder in my RTL, how much power is it consuming? You’d bring that up to the top-level hierarchy, but it’s like going to the doctor with your kid running a temperature, and he sticks a thermometer in their mouth and says, ‘Yep, your kid’s got a temperature,’ but not saying what to do to fix the problem. Even going down to the backend and doing gate-level simulation, that’s still only going to give you a number. It’s not going to expose to you where you’re actually wasting power or whether you are wasting power. This technology shows the engineer on a flop where they are wasting power, where they have a power leak.”

Conclusion
There is no single way to improve power at advanced nodes. But there are many improvements that can be made to get to the same goal. The tradeoff is this takes some sophisticated tooling early enough in the design cycle to be able to thoroughly analyze multiple operating scenarios and use cases.

“Applying emerging technologies such as elastic computing on big data architectures will make it feasible to run designs billions of instances in size with drastically reduced runtimes,” said ANSYS’ Gupta. “More importantly, such platforms also enable cross-domain analytics such as power and timing and will bring focus to the right problems and reduce margining.”

Standards work in the power arena, particularly around UPF, has at least provided a foundation for these improvements. “It allowed normal size design teams to apply some of the more aggressive power-saving techniques,” said Wingard. “Now that the UPF has been standardized and the large EDA vendors and the flow providers have working flows, it becomes feasible for the techniques to be deployed on a much larger scale. By a much larger scale I mean both by smaller design teams and things that aren’t the flagship chips of their companies necessarily, but also to a larger extent inside all designs. We now have a tooling and flows that allow us to partition our chips into larger numbers of more defined and controlled pieces. The next step is to consolidate the gains available from the techniques we’ve already been using at the high-end for number of years and make complete design methodologies that can leverage those on a regular basis as part of the main design instead of as an afterthought at the end of the design. That’s a big change.”

The big change is using methodologies that allow people to think about power and energy in addition to functionality early in the design phase so that some of the choices designers make are for energy-saving reasons rather than just functionality.

Related Stories
New Power Concerns At 10/7nm
Dynamic, thermal, packaging and electromagnetic effects grow, and so do the interactions between all of them.
Frenzy At 10/7nm
Focus is on cutting costs across the board, and it turns out there is still quite a bit to cut.
Tech Talk: 7nm Power
Dealing with thermal effects, electromigration and other issues at the most advanced nodes.
Noise Abatement
Will noise compromise your next design? The only way to answer that is to understand which aspects of noise are getting worse and the availability of analysis tools to help mitigate issues.



  • Kev

    I wrote myself a white-paper on why you need to move to asynchronous design implementation for sub 28nm about a decade ago, and the arguments are still the same.

    Handling power in the design flow requires tools that understand power – Verilog-AMS was designed with that intent, and the folks doing UPF should have chipped in on that rather than doing the half-assed standalone standard, i.e. RTL (Verilog) + UPF goes into synthesis, Verilog-AMS should come out, but it doesn’t, so you can’t verify the intent was implemented or what you power usage actually is.

    Recent attempts at getting DVFS/body-biasing verification into SystemVerilog have also gone nowhere. However, it is possible to move the problem into analog domain and verify there using behavioral modeling – here’s what you need:

    https://xyce.sandia.gov/

    http://www.wiley.com/WileyCDA/WileyTitle/productCd-0470226099,subjectCd-EEJ0.html

    These guys will tell you how to do asynchronous CPUs –

    http://etacompute.com/

    Here’s an approach for asynchronous front-end design –

    http://parallel.cc

    • David Stringfellow

      Please cite a single successful chip that was implemented using an asynchronous design approach.