As chips and systems grow in complexity, power budgets are getting stretched. Just shifting left doesn’t solve all problems.
Power optimization is playing an increasingly vital role in chip and chip and system designs, but it’s also becoming much harder to achieve as transistor density and system complexity continue to grow.
This is especially evident with advanced packages, chiplets, and high-performance chips, all of which are becoming more common in complex designs. Inside data centers, racks of servers are struggling to keep pace with rapid increases in data, fueled by AI and more sensors everywhere. And at the other end of the spectrum, mobile devices now include a growing array of features — high-resolution cameras with built-in AI, more complex CPUs, and other components that consumers take for granted — all of which need to run off a single battery. Each of these opens the door to a slew of challenges involving power dissipation and optimization, and engineers at all levels are feeling the pressure.
“People are worried about power and energy because of the magnitude of power that is being consumed by some of the compute devices, whether it be for generic cloud or for AI machine learning,” said Neil Hand, director of strategy for IC verification solutions at Siemens EDA. “You look at the energy requirements of a new data center and it’s mind boggling. People talk about Moore’s Law dying, but then you look at something like Koomey’s Law, which shows that from a power perspective we’re still doing pretty well. Every 2.5 years we’re getting a doubling of the efficiency of the compute, but if you look at the data for compute needs, these are growing greater than exponentially.”
As performance increases and transistors shrink into 1.x nm range, chipmakers are weighing tradeoffs between energy efficiency and performance. Performance still often wins out as the critical metric, with cost and time a secondary consideration. But power increasingly is both the gating factor and the top consideration for design teams.
“These challenges are no different than we normally have,” said Hand, “And if you talk about energy efficiency, the challenge and the solution are going to change, depending on who you’re talking to in a particular company, because it’s such a wide area. But now it’s an area that has influence on every step, from concept all the way through manufacturing.”
Glitch power and shrinking transistors
The tradeoffs aren’t always obvious, however. For example, chips designed for high performance lose more than 33% of their electricity to glitch power, a huge jump compared to the 13% of a much simpler CPU and almost seven times more than an SoC chip, according to numbers supplied by Ansys.
“This is a staggering number, really,” said Suhail Saif, principal product manager at Ansys. “CPUs are typically very tightly clogged, so their glitch power is still less. These are real designs, which are already in production. You have the phone in your hand with a chip in it, or a computer in your lab with a chip in it, or something that is going to be released this year. These are very recent real world production designs, and the trend shows an alarming increase in glitch power component in these designs.”
Losses to glitch power have become especially prevalent in AI accelerator designs due to the large amounts of arithmetic functions they perform, according to William Ruby, director of product management for Synopsys’ power analysis products.
“Glitch power, which is essentially multiple transitions per clock cycle, has become really important,” he said. “Adders and multipliers are notoriously glitchy elements. If you can analyze where your glitch sources are, and how much glitch power they’re consuming, then you can take action. You can optimize your microarchitecture and reduce glitch power.”
There are two different types of glitch power. “We have transport glitches, which are what everyone thinks of as traditional glitch,” Rob Knoth, group director for Strategy and New Ventures at Cadence. “It’s a logic swing that doesn’t get attenuated by the gate, so it causes a swing on the output of the gate, which then will continue to propagate. Another type of glitch that often gets ignored is an inertial glitch. These occur where there’s an energy pulse that comes in on an input of the CMOS gate, but doesn’t have enough energy to do a full rail swing on the output. That’s still going to burn some power inside the gate, so we model that. But the whole reason, and the beauty of differentiating two types of glitch power, is it allows us to more accurately and effectively optimize glitch power.”
Glitch power is largely a side effect of shrinking transistor size, Ansys’ Saif said. “The issue was not a concern five or 10 years back, but it is such a significant concern now.”
Further, the continued shrinking transistors of digital logic is affecting power efficiency. As a result, improvements in the near future are likely to be incremental. “I don’t think we’ve hit maximum efficiency,” said Steve Roddy, chief marketing officer at Quadric. “As geometries continue to shrink, 3nm this year, 2.5 or 2nm in a couple of years, you’ve got more transistors to play with, but your power budgets are still limited. So you’re going to see more variation of processing elements in that same slab of silicon.”
Solutions lie in shifting left
Increasing functionality in the same power envelope is an industry-wide challenge, whether it’s in cutting-edge data centers or a Ring doorbell, noted Roddy. This has put a spotlight on specialized processors that can do one thing exceptionally well, versus a generalized processor that can do many things, but less efficiently.
“[The general processor will] take longer, burn a lot more power, and burn out your battery or thermally limit what you can do all at the same time,” said Roddy. “The number one takeaway is more and more specialization and more unique tailored processing. Instead of just putting a single NPU into a chip, you have to have two, maybe three, depending on the use case — two or three vision processors, two or three audio processors, a handful of generic Arm cores, and you have to have the GPUs, too. Next thing you know, you’ve got dozens of processing types, and performance levels scattered around the chip and you’ve got silicon transistors to spare, because if you get a doubling of available transistors every year, but your power envelope stays the same, you have to find ways to activate and turn things on, and turn things off. That’s something the cellphone chip makers have been chasing for 15 years, and we’re starting to see it in lots of other segments, data centers in particular. What was the data center chip five years ago? It was just 16 copies of an x86, or 64 copies of an Arm, and you’re done. Now, go look at a big NVIDIA chip. It’s got all kinds of accelerators, and all kinds of specialty stuff, because they’re power limited.”
Designing systems with that kind of complexity is challenging, and even more so when attempting to optimize it for power efficiency without affecting the required performance level. That necessitates tradeoffs, which are very difficult to make without analyzing the system as a whole.
“If you are missing even one aspect of it, you will be able to make decisions, but you won’t be able to rely on it because they might be wrong,” Saif said. “You might be missing something, so a system level analysis tool is needed — a solution that builds all this data together bottom up. Typically, the systems are huge, and as the size of system goes up, it really becomes very difficult to analyze everything accurately. EDA solutions can do approximations and get results much, much faster for a very large system, but it will be an approximate result.”
Regardless of the abstraction level, all of this needs to happen earlier in the design flow. “[Shifting left] is becoming especially prevalent as we go into software-defined products,” said Siemens’ Hand. “When you look at software defined products, the semiconductors that are enabling that software are no longer something buried deep inside. It’s almost an afterthought of how much power that uses, but a primary energy user in those systems. How do you then adapt to that? You need to get those control system algorithms, the understanding of the architecture from the ideation stage all the way through it, because that’s how you’re going to fundamentally change the energy profile of the design.”
Synopsys’ Ruby agreed, noting this is particularly important for reducing glitch power. “The earlier you start examining glitch power, quantifying how much you’re consuming, quantifying and analyzing what the sources of glitch power are, the earlier you can take action to minimize it. Even in the very early stages of the design, at the register transfer level coding, there are techniques that can be employed to reduce glitch power. There is also the notion of the micro-architectural tradeoffs. Different types of micro-architectures can be used to reduce glitch power. For example, if I have glitchy signals coming into a big multiplier, if that multiplier starts transitioning many times during the clock cycle and creates a lot of glitch power, can I isolate those sources? Can I let the source settle down in terms of its transition first, because that has a very small fan out with, small capacitance, and only when the source settles down can I then feed it to the multiplier? The multiplier must work with the final result and not continuously change inputs.”
New approaches are also being employed, such as glitch-absorbing cells. Those are specifically designed not to propagate the glitch pulse downstream, so it covers the whole gamut of architecture, micro-architecture implementation, and then even the cell library design, Ruby said.
But shift left does have limitations. The earlier in the design stage, the less accurate the estimates due to a lack of information about workloads, resistances, capacitances, and gate sizes.
“You have to be paying attention to power early, but you also have to be paying attention to power during implementation, during sign-off, etc., because if you leave that on the table, you’re still screwing up,” Knoth said. “The hard work doesn’t just go away because you shifted left. Are the big power savings there? Yes, but if you don’t have accurate estimates — if you don’t have good formal verification methods, if you’re not continuing to do the work during the rest of the flow and checking the answers to make sure that those savings you thought should materialize actually do — then you just wasted a lot of time and fooled yourself.”
Saif pointed out that while there is a theoretical glitch power limit for high performance chips, nobody has yet approached it because of the difficulty of doing so in the practical world, thanks to the restraints of time and cost. While EDA tools are getting faster, reducing glitch power remains an iterative process, with each iteration taking time to run and analyze.
“What design teams do is try to find the highest violators they’ve got,” said Saif. “They go for the easiest targets that they can work on in a given time, and the ones that will give them highest ROI. This is the reality of the situation. Most EDA solutions, including analysis, focus on that. If there are 10,000 violations, we don’t focus on the last 1,000 or last 5,000 even. We don’t even care about the remaining 50% because they are too difficult to get to and too little ROI.”
These approaches remain valid for even more complex chip designs, such as 3D-IC, even though these stacked models come with added complications, such as the thermal effects management.
“There is some work going on here in terms of efficiency of signaling between different dies,” Ruby noted. “With low-voltage differential signaling, small swing signals will reduce the power because there is less voltage swing for a given capacitive trace and power as a function of capacity, capacitance, and voltage. You can also look at the fact that you may not be able to change the power consumption, but through different types of 3D-IC packaging technologies, you can see how the packaging side can mitigate some of these thermal issues. You’re not really going to be able to reduce the power consumption, but you may be able to drop the overall temperature or reduce temperature gradients through some of these advanced packaging techniques.”
There, integration of AI into EDA tools may help further improve power efficiency, according to Knoth. “Using AI can explore a broader solution space more effectively,” he said. “The end result will appear that fewer iterations are needed. But the truth is the AI is actually exploring more iterations, more potential permutations, options, etc., and presenting fewer for the human to choose from. So the result is fewer iterations, but the reality is that to get there you need to explore more.”
Data center impact
While the size of transistors is shrinking, the power consumption of data centers is going in the opposite direction, possibly reaching 35 gigawatts in the United States by 2030, and twice the energy consumption in 2022 according to a report from consultancy firm Newmark. That presents a challenge for already extant data centers, which were built with rack densities of approximately 10 kilowatts per rack in mind, according to Mark Seymour, a distinguished engineer at Cadence. That’s put pressure on engineers to find ways to maximize energy efficiency while simultaneously finding solutions for one of power’s key side effects: heat.
“The thing that was going to happen forever, which is transitioning back to liquid cooling, is going to happen more and more,” Seymour said. “You’re seeing more liquid cooled equipment coming onto the market that’s absolutely essential for the highest power equipment, because the power density of the chip is such that we’re getting to a point where you simply can’t do it with air.”
Seymour also called for optimism in terms of the projected rise in power consumption, noting that the industry has done a good job of keeping data center consumption flat over the past decade. “Power usage effectiveness (PUE) historically was approximately 2.5. For many years, it’s been down at 1.5. But best-in-class data centers are down at about 1.1, so there’s some scope there for us to continue to push PUE down in the existing data centers to save energy,” he said.
Conclusion
As chip complexity grows, so do the number and intensity of power-related issues. There is no single solution for everything, and not everything will work perfectly every time.
Shifting more of the tradeoffs left in the design flow can help, particularly when accompanied by AI/ML. But more innovation will be required to address glitch power and overall system design efficiency, coupled with a better understanding up front about how and where a device will be used. Tradeoffs are now standard practice in chip and system design, and it’s important to understand the end application and real-world workloads in order to make better decisions.
Related Reading
Glitch Power Issues Grow At Advanced Nodes
Problem is particularly acute in AI accelerators, and fixes require some complex tradeoffs.
New Approaches Needed For Power Management
Limited power budgets, thermal issues, and increased demand to process more data faster are driving some novel solutions.
Managing KW Power Budgets
Strategies for dealing with increasing compute demands from AI and other applications.
Leave a Reply