Power Issues Causing More Respins At 7nm And Below

Power may be the top reason why advanced chips are failing, but you have to look behind the numbers to draw conclusions.


Power consumption has been a major design consideration for some time, but it is far from being a solved issue. In fact, an increasing number of designs have a plethora of power-related problems, and those problems are getting worse in new chip designs.

Many designs today are power-limited — or perhaps more accurately stated, thermal-limited. A chip only can consume as much power as it is able to dissipate while keeping every part of the chip below a temperature that would cause failure. Thermal analysis is difficult because the timescales involved are on the order of seconds or minutes, whereas the devices are operating at multiple GHz. That means vector sets to do analysis are enormous, and run times are extremely long.

How bad is the problem? It’s not entirely clear, because the numbers hide the true story. In the latest functional verification study by Wilson Research and Siemens EDA (see figure 1), power issues show up as the third-largest cause of chip respins. But that is just the average number for all designs.

Fig. 1: Cause of ASIC respins. Source Siemens EDA

“Roughly one in four ASIC respins are due to power issues,” says Harry Foster, chief verification scientist at Siemens EDA and the driver behind these studies. “While the data may suggest an improving trend over the past six years, power issues remain the third leading cause of respins according to the data. However, I believe this is a conservative number since some projects are likely to lump a power controller bug in with a functional bug. I can’t extract this type of information from the survey data.”

A full understanding of what is behind the numbers requires a deeper dive, so Foster cross-correlated that data with design sizes and fabrication nodes. “There is a statistically significant increase of respins due to power issues that occurred in 7nm or smaller,” he says. “In fact, it is twice as much as 28nm to 10nm. Above that, we see that power-related respins start to decline significantly as node size increases. As for gate count, there is also a statistically significant increase of respins due to power issues that occurred in designs greater than 10M gates. In fact, the largest bucket was on designs greater than 10 billion gates. Conclusion: The data suggest smaller node size and larger gate count are a contributing factor to power-related respins.”

That means power could be the number one cause of respins for the biggest designs at the latest nodes, and that is before adding in the failures that are not recorded as power-related. For example, timing-related issues may well be caused by power.

“There are situations where you didn’t anticipate a dip in voltage,” says Marc Swinnen, director of product marketing at Ansys. “When that happens, it is not that a transistor stops working. It just switches slower. Those slowdowns in a path cause it to not work at frequency. The drop in fmax that people are experiencing is because they didn’t capture the right switching combinations that caused this situation.”

There is a methodology gap emerging. “When you’re talking about bringing a software aspect into hardware design, you’re potentially talking billions of cycles of data,” says Piyush Sancheti, vice president of system architects at Synopsys. “For example, we have a customer that wants to profile the energy in their chip for five seconds of operation. Five seconds at roughly a gigahertz frequency basically equates to 300 billion cycles. They need to emulate 300 billion cycles of the actual application running, and want to do the power and energy consumption profile for that time period. This calls for a very high-speed emulation solution.”

But the methodology issue goes deeper than that. “We see systems failing when a new version of the software is installed,” says Amir Attarha, product manager for Veloce Power at Siemens EDA. “This is really an indicator of a problem with the flow. Typically, verification people run tiny tests. They are written by verification engineers, and run during simulation and estimating power. They should be focusing on workloads or end user software, which will often reveal problems missed by small test cases. For power estimation, the trend is shifting left. It currently spans all the way from sign-off, and now to much earlier power analysis. This means that regression-based power analysis should no longer be relegated to an afterthought. Key power indicators must be re-assessed weekly when they run the functional verification. That means they need a regression suite for power.”

The past few years also have witnessed an increase in the number of companies attempting large chips. “There are two segments that jump out with significantly more power issues,” says Foster. “They are automotive, and CPU/microprocessor/embedded processors/etc. Automotive doesn’t surprise me. I have been in multiple discussions with automotive companies recently that have jumped into the silicon business. Some have a very naive view about complex IC design, and some have made incorrect assumptions. In addition, my gut feeling is that some system houses are also naive, but I can’t quantify it with the data I have. Concerning processors, I am unable to distinguish, for example, if this could have been a custom processor being created by a system company.”

Another group of companies also wants power optimization to remain hidden and taken care of with tools. “Energy optimization tools cannot make the software development process more difficult, and it is felt that energy savings are meant to happen behind the scenes,” says Scott Hanson, CTO and Founder of Ambiq. “The desired result is to provide device users with a more favorable user experience, as the device will benefit from the increased battery lifetime and performance as the energy-saving operations will happen ‘magically’ behind the scenes.”

Interrelated optimization
As systems become more complex, many of the optimizations become more difficult. “There are multiple tradeoffs to make,” says Sergio Marchese, senior member of technical staff at SmartDV. “For example, the more refined the clock gating function, the more gates it will cost, as well as engineering effort to implement and verify. The challenge is that IP developers often cannot measure the impact of their decisions. Only system integrators can tell if the savings achieved are worth the effort and extra logic. A streamlined process is needed to quickly and reliably adapt the IP power-saving functions to the specific needs of the system so that it can deliver the optimal balance.”

Power issues also can get entangled with aging, and dealing with this requires sophisticated on-chip monitoring. It can show you how certain library elements shift with time and the remaining timing margin along critical paths. It requires careful planning. “Where do you put the sensors? If you scatter them evenly across a grid, the chip only has certain hot spots,” says Ansys’ Swinnen. “So you are wasting a lot of sensors. But you don’t want to miss a hot spot. If you do enough analysis, you know where the hotspots are. Tools to provide this information are fairly new. How much safety margin do you build in? The more you can predict this, the less you have to guess and guard-band. Whoever has the best thermal tools can come closer to the margin and get higher performance or reliability for the same cost.”

This is a field that is developing rapidly. “Aging can impact all forms of communications,” says Tomer Morad, group director for R&D in Synopsys’ Embedded Software Development Group. “We now have monitors for die-to-die signal integrity. They are measuring the quality of the signal crossing the balls basically, from one die to the next. It’s high-rate sampling of the signal. From that sampling, you can generate an eye diagram which gives you basically the quality of your signal. Stress is a mechanical thing. As the device heats up, then cools down, stress will be there. And that’s one of the concerns now, because from die-to-die, there are thousands of connections. The probability of something failing is increasing, and also the speed going through those connections is increasing. So there is high interest in monitoring.”

Advances are happening in many areas. “You cannot always predict the workload in a data center,” says Firooz Massoudi, member of the technical staff at Synopsys. “Because of that, you cannot set a frequency or voltage statically. You want to be able to observe what’s going on. If a core has something running on it, it will be running at a higher temperature, and you may have to adjust the voltage accordingly. You also can use that feedback in many ways. For example, if you have many cores, and some of the cores are closer to the perimeter of the die, they will likely run cooler. So the optimizer software can distribute the software load accordingly, and maybe cooler cores will get the load versus another.”

Modeling and analysis
If you can’t measure something, you can’t optimize it. This is a stumbling block for many power reduction techniques, and it all starts with models.

“A certain amount of accuracy is needed, but not so much accuracy that the turnaround time for the experiments is untenable,” says Rob Knoth, product management director in Cadence’s Digital & Signoff Group. “We do have UPF (Unified Power Format) that defines the power intent. Originally, this was ad hoc stuff that people who were building the first cell phone chips were cooking up themselves. But then they needed to scale it, to productize it, to automate it, etc. Now, pretty much every design is using some form of power intent.”

Those UPF files can become very complex. “UPF is the mechanism to capture sophisticated power management intent,” says Synopsys’ Sancheti. “We’ve seen designs where the UPF power intent is the same number of lines, or more, than the design itself. It is absolutely the case that complexity of power management is going up very rapidly.”

But this leaves two problems. First, how do you verify the UPF contents? And second, this does not provide the power or energy profiles. It only defines the power features that should exist in the design.

“If somebody wants to do a true software-level power analysis, you can’t expect to be doing power estimation the old fashioned way,” Sancheti says. “That required converting the design into gates, and then computing the power of each gate in the design. That’s useful for RTL or implementation level power, but what is needed for system-level power analysis is modeling. And these have to be higher-level system models. System-level power modeling has been a well-understood but unsolved problem in the industry for the last 20-plus years.”

There was hope for progress a few years back. “In 2019 there was the release of what was called the Unified Power Model (UPM),” says Gabriel Chidolue, senior principal product engineering manager for Siemens EDA. “Originally developed within Si2, it was then transferred to the IEEE 2416 committee. There was a lot of buzz at that time. But like all these new standards, a lot of the users need to figure out how that fits into their workflow. People have struggled to figure out how to estimate power for a long time, and they’ve just settled on some solutions. But they have not exploited the existing infrastructure just yet.”

Instead, users appear to continue concentrating on lower-level power optimizations, and those require more accurate power estimation than often can be attained using system-level modeling. But an array of techniques is being used by some companies.

“We use a lot of modeling and simulation ahead of time to determine which hardware and software mechanisms and policies should be implemented,” says Chris Redpath, technology lead of power software within Arm’s Central Engineering. “We also use power instrumentation, use-case measurement, and analysis of scheduling data to figure out how effectively the controls are working, and to find further improvements. In the software teams, we work closely with our hardware teams to understand what they implement and provide guidance on which controls and metrics could be used effectively by software. The end result is a virtuous circle, providing well-integrated systems that run efficiently and provide high peak performance.”

Virtual prototypes can help in this regard. “Some of our customers are applying hardware-based emulation to generate a robust set of functional stimuli that model different workloads,” says Cadence’s Knoth. “Then, using high-level synthesis and RTL-based power estimation, they are able to explore many different scenarios (see figure 2). Should this be in software? Should we put this on a DSP? Should we do a hardware accelerator, or a dedicated hardware accelerator?”

Fig. 2: Using virtual prototype to perform hardware/software tradeoffs. Source: Cadence

This often requires fairly crude power models. “A lot of power management is being conducted in software,” says Sancheti. “You need a representation of the hardware that allows you to do that level of tradeoffs before you commit to writing a single line of RTL. There is a lot more emphasis on architectural-level planning and exploration, which can help system designers do this type of what-if IP selection, the domain partitioning, the voltage-versus-DVFS strategy. It all needs to happen at the architectural level, and this is a big area of investment for our customers.”

While it may be one of the biggest problems facing the semiconductor industry, there are no well-defined methodologies or tools that assist with the optimization of power at anything above the register transfer level. Power modeling, analysis, and verification techniques are crude compared to those adopted for functional verification, and there is no indication this will improve in the near future. We can therefore conclude that power will continue to cause a lot of respins, and that the trend for advanced design is likely to worsen while ad hoc techniques continue to be used.

Leave a Reply

(Note: This name will be displayed publicly)