Using Less Power At The Same Node

When going to a smaller node is no longer an option, how do you get better power performance? Several techniques are possible.

popularity

Going to the next node has been the most effective way to reduce power, but that is no longer true or desirable for a growing percentage of the semiconductor industry. So the big question now is how to reduce power while maintaining the same node size.

After understanding how the power is used, both chip designers and fabs have techniques available to reduce power consumption. Fabs are making efforts to improve older nodes and chip designers are targeting things like clock trees and power gating.

No new nodes
A node migration may not make sense for several reasons. “From a cost perspective, an increasing number of designs are not migrating to lower nodes at the rate they used to,” says Sunil Bhardwaj, senior director of business operations for IP cores at Rambus. “Not all designs will see a cost advantage from moving to a lower technology, especially once mask set costs, IP cost, and wafer costs are considered. Very high-end performance and large size or high volume may be the few that continue to see advantages, offsetting the cost of a technology transition.”

And even with high-volume designs, reliability is becoming a growing problem at each new node. “Variation is one piece of this,” says Aveek Sarkar, vice president in Synopsys’ Custom Compiler Group. “The second piece is reliability itself, which is electromigration, self-heat and variation. You need to factor in how each of those affect your design, and the post-layout variation. So with RC (resistance/capacitance), you may model the effects for pre-layout simulation to some extent, but because of variability the post-layout effect can be very big.”

These and other effects are making it far more difficult to get designs right the first time, and they are adding significantly to the overall cost. As a result, chipmakers are rethinking what moves to the next node, and what says at more mature nodes.

“GPUs and artificial intelligence do require the new nodes, but it is very expensive,” says Mohammed Fahad, consulting staff lead for synthesis solutions at Mentor, a Siemens Business. “Not all companies can afford that. There are a lot of design companies still working on higher nodes such as 28nm. There are performance improvements when you move to advanced nodes in terms of speed, and there is a power reduction, but given the costs associated with that companies are trying to explore new avenues and other methods to reduce power.”

As fewer designs start on the new nodes, it creates a ripple effect through the industry. “We have seen a trend that started with the mobile phone industry and is now increasing across the industry,” says Rob Knoth, product management director in the Digital & Signoff Group at Cadence. “We see people who are either not jumping to the next node or have to stay on the current nodes because of IP availability or other reason.”

If staying at the existing node, you need to keep improving a product using effectively the same number of transistors. “You have to invest more heavily in things such as architecture and RTL optimization where you can have a big impact and still stay at the same technology node,” says Joe Davis, director of product marketing for Calibre Interfaces at Mentor, a Siemens Business. “When you have to stuff more and more into the same technology and die area, you have to look harder at optimizing. The margins are squeezed, so the cost/benefit tradeoff changes. And designers are suddenly interested in features and tools that previously didn’t provide sufficient value.”

Knowing where the power goes
But without a full understanding of where power is being consumed, any attempt at optimization likely will be a wasted effort. “We are investing more time analyzing power data to understanding where the power is consumed,” says Marc Galceran-Oms, senior manager for ASIC engineering at eSilicon. “FinFET technologies consume more in dynamic power and less in leakage power, which means that activity factors are very important to the power analysis.”

And you have to know what to look for. “Rather than just trying to hit a performance target or a power target, people are beginning to look at energy,” says Knoth. “It is more than just the power you are consuming. It is what are you accomplishing with that power. This is changing the way in which design flows are created, the way in which RTL is created. This is marrying together not just the implementation of the device, but also the device in terms of its functional context. How power efficient is my device for doing its intended task?”

For some companies, the implication of power needs to be considered. “We identify, based on activity, the parts of the design that are potential hot spots,” says Mentor’s Fahad. “Based on switching activity, which is dependent on use case, we can generate heat maps at the RTL, gate and system level. You get to see an activity plot over time. You can also identify the areas of peak power. Peak power provides a fair indication of where heat and other power problems can occur.”

That does not mean that older analysis techniques should be dropped. “We suggest that the user starts to perform analysis before they even have vectors,” adds Fahad. “It enables them to see if there is anything they can do at the very early stage. These early design checks are based on linting and look at the structure of the design. Later, when the vectors become available, we can do analysis for things like sequential clock gating for memory and flops.”

Working on the back end
When the whole industry is not chasing after the latest node, everyone has more time for optimization. That starts with the foundries. For example, TSMC has been improving its 40nm process technology. New additions include 40nm enhanced LP and 40nm Ultra Low Power (ULP) processes. Compared with the 40nm LP process, the 40nm enhanced LP boosts performance by up to 30%, while 40nm ULP cuts leakage current by up to 70% and lowers power consumption by up to 30%.

“Foundries are treating their technologies like products and creating product suites that span the market needs for power, performance, and specialty requirements,” says Mentor’s Davis. “The result is that many mature technologies are getting not just an updated PDK, but a complete refresh.”

Custom libraries also may make more sense. “Custom libraries are needed and increasingly being used for high speed logic in serializers/deserializers in our designs,” says Rambus’ Bhardwaj. “These blocks run at high speed and face not only gate capacitance but the interconnect R and C, and do not always see much of a power benefit with technology.”

Others agree. “With finFET technologies, we have been developing our own special standard cells much more than in the past,” says eSilicon’s Galceran-Oms. “These are highly optimized for power. We only develop those cells that have a large impact on our designs. A few examples would be low-power flip-flops, large clock buffers and special muxes. These low-power flip-flops typically save us from 5% to 10% overall power reduction, depending on the design. We also are using multi-bit flip-flops more than in the past, and we invest more effort in understanding when and how to use these low power flip-flops and multi-bit flip-flops to reduce power while meeting timing.”

Leakage is a problem for both planar and finFET technologies. “While newer technologies do offer voltage supply reduction, which directly impacts power,” notes Bhardwaj. “The power supply voltage is not decreasing at the rate it used to in the past. Smaller geometries also increase leakage power, and techniques such as power gating are being used extensively.”

“People have become more sophisticated in their usage of power gating techniques,” says Jerry Zhao, product management director in the Digital & Signoff Group at Cadence. “Chips may have a dozen or two dozen power domains. In these domains they have hundreds or thousands or power gating switches that will optimize power consumption.”

Not every company is using power gating yet. “There is a lot of overhead for the designers when they start doing power gating,” admits Fahad. “You have to get involved with UPF. Unlike many RT level optimization, where the cost of change is small, power gating schemes are different, and designers see the writing a good UPF as an overhead.”

If an IP core is going to last longer and have a greater chance to be reused, it can often be made better. “Parasitic aware flows and topologies can leverage pre-laid out templates with known parasitics,” says Bhardwaj. “These are being used to reduce the cycle time and obtain reduction in power, which does not come by easily until the right topologies and resistance and capacitance (RC) estimates are built in from day one. In a way, this is a hybrid approach where bottoms up and top down design choice phases coexist from the very beginning to ensure right circuit estimates are going into architectural selection and power optimization.”

Thermal is a downside of power consumption and has to be managed. “If power increases, what is the impact on my silicon?” asks Zhao. “You may have a hot spot and that could create a thermal runaway. But, if power is uniformly distributed, the impact to the silicon on the thermal side is not as bad. On that piece of silicon, some places will have high temperature and you can monitor them to check that the temperature is OK, but you want to make sure that the implementation will distribute the power as evenly as possible.”

Embedded analysis provides more opportunities to squeeze margins. “With smaller geometries, there is much more emphasis on reducing core supplies,” says Stephen Crosher, chief executive officer for Moortec. “The use of higher accuracy supply and temperature sensors allow for tighter voltage and thermal guard-banding which means that you can increase the utilization of cores within a chip for given power and temperature conditions.”

Many of these optimizations require more information from the front-end flow. “We are definitely using more advanced EDA features to squeeze margins,” says Galceran-Oms. “One example is relying more heavily than before in dynamic voltage drop data based on realistic application scenarios to decide whether we can waive marginal timing paths.”

Working at the front end
It is often said that the earlier you start, the greater the benefits. “There are tools at the RT-level that will tell you much more than looking at the back-end,” says Fahad. “You do not just have to change the node to get power improvement. You can use RTL power reduction tools and methodologies. They start by showing you where power is being wasted in the RTL.”

Clock trees are one of the most power-hungry aspects of the design. “Many teams already consider local clock gating at the flop or module level,” adds Fahad. “Using deep sequential analysis, we often can find a common minimum gating expression at the top level or ones that contain two or three modules. We can then pull that expression up in the hierarchy and try and find a condition that can block the clock of many modules together.”

Another power hog is memory. “We have a heavy reliance on customized memory compilers and more specifically customized memory instances,” says Galceran-Oms. “In large networking, HPC and AI designs, there is a clear trend toward highly repetitive, tile-based architectures. As such, these designs sport a reduced number of different memory instance sizes (words x bits) repeated many times. Given that memories represent 30% to 50% of overall power consumption and die area in these designs, the impact is very significant.”

Why stop there? “We see some companies that go farther back upstream because they recognize that it is at the architecture level, and even the software architecture, that is the biggest lever for overall power efficiency,” says Knoth. “There is only so much you can do sizing gates and placing gates where you are polishing the problem, but it is farther up that you decide if you should use a ripple carry adder or a carry lock-ahead adder. Even farther up, where you are looking at how many cores will we need on this chip, how will the workload be distributed across them, when are we shutting things off, when are we going into sleep—do we need a deeper sleep mode?”

One question is how such high-level changes get made. “On top of clock and memory gating, we provide a rich set of guidance that we suggest to the user for further improvements in dynamic power consumption,” says Fahad. ” An example would be a redundant mux toggle. If there is a mux in the design and one of the pins is toggling when the select is looking at another input of the mux, the design can be altered. Another example would be a suggestion to change a shift register to a circular buffer. The user then looks at these recommendations and can manually implement changes that bring about power reduction. Customers do not like tools that change RTL because controllability of the RTL is lost.”

But in some tools, that is the point. “With high-level synthesis, it is not a human who is picking the end RTL representation,” says Knoth. “You can explore architectures more efficiently at the C-level and make a quantitative choice based on how power efficient the architecture is rather than trusting what someone thought and started coding in RTL. We are seeing that a lot in imaging processors where they do spend more time at the algorithm level.”

Tying them together
All power optimizations can benefit from knowing context. “A good example that benefits everyone is the usage of emulation technology to do power analysis,” says Zhao. “That is a quick way to see if one architecture is more dynamic power efficient than another choice. Once they have that, they can provide the activity as guidance to the implementation side. The activity from emulation runs can help narrow down where and when the peak power will happen. Then I can look at the silicon to see if it will be a problem because of thermal.”

Design and fabrication come together in another way. “For more sophisticated users, or users with bigger systems, multi-die solutions may be advantageous,” adds Knoth. “Here you can optimize which parts of the design are more ideal for certain process technologies and that will help to improve the overall power efficiency and improves the power density. If you are not trying to cram everything into one piece of silicon, that can be a good mitigation strategy that is process node independent.”

Conclusion
Moving to the latest node used to be the easy path for reduced power, but that is no longer the case. That means that companies have to invest more time and effort in all stages of power reduction of what they have.

It is also clear that having good use-cases that faithfully represent typical workloads of your system can heavily influence the power optimization strategies.

What can you hope to achieve? “This is a number that everyone wants to be quoted, but it is not possible,” says Fahad. “It all depends on the maturity of the design, the application, the activity and other factors. We have seen reduction that range from 2% to 50%. It depends upon how good the design was to start with. If you are a very good designer who is power-aware, you will not get a large reduction. But if you are less experienced, then you may see larger improvements.”

Related Stories
Power Delivery Affecting Performance At 7nm
Slowdown due to impact on timing, and dependencies between power, thermal and timing that may not be caught by signoff tools.
Power Issues Rising For New Applications
Why managing power is becoming more difficult, more critical, and much more expensive.
Taming NBTI To Improve Device Reliability
Negative-bias temperature instability can cause an array of problems at advanced nodes and reduced voltages.
2.5D, 3D Power Integrity
Things to consider in advanced packaging.



Leave a Reply


(Note: This name will be displayed publicly)