The Rising Price Of Power In Chips

More data requires faster processing, which leads to a whole bunch of problems — not all of which are obvious or even solvable.

popularity

Power is everything when it comes to processing and storing data, and much of it isn’t good. Power-related issues, particularly heat, dominate chip and system designs today, and those issues are widening and multiplying.

Transistor density has reached a point where these tiny digital switches are generating more heat than can be removed through traditional means. That may sound manageable enough, except it has created a slew of new problems that require the entire industry to solve — EDA companies, process equipment makers, fabs, packaging houses, field-level monitoring and analytics providers, materials suppliers, research groups, etc.

Underlying all of this activity is a continued focus on packing more transistors into a fixed area, and an associated and accelerating battle with leakage power. FinFETs solved leaky gate issues at 16/14nm, but the problem resurfaced just two nodes later. At 3nm, a totally different transistor structure is being introduced with gate-all-around FETs (a.k.a. nanosheets), which make design, metrology, inspection, and test significantly more challenging and expensive. At 2nm/18A, power delivery will begin flipping from the frontside of a chip to the backside to alleviate routing issues for even getting sufficient power to transistors, and beyond that the industry is likely to change its transistor structure once again to compound FETs (CFETs). These are a lot of process and structural changes in a short time window, and each new node will include more issues that need to be tackled.

For example, a growing concern in high-density chips and packages is transient thermal gradients. They can move in unpredictable ways, sometimes very quickly and other times not, and they can vary based upon the workload. At 40nm, using thicker dielectrics and substrates, and more relaxed pitches, these were considered annoyances. With today’s leading-edge process technology, all of this needs to be taken much more seriously.

“The base leakage is lower than the previous technology, but the overall total power is higher,” said Melika Roshandell, product management director at Cadence. “So at the end of the day, your thermal is going to be worse because you’re packing a lot more transistors into one IC and you’re pushing the performance. You want to go to higher and higher frequencies, and to do that you’re increasing the voltage, you’re increasing the power. And now your total power is more than the previous generation, so your thermal is going to be worse. Not only that, when you go to a lower node, your area is also shrinking. That area shrinkage and increase in total power is going to be a recipe sometimes for disaster for your thermal, and you’re not going to meet your performance because your thermal is going to hit a lot faster than what you were expecting.”


Fig. 1: Thermal-mechanical co-simulation of 3D-IC design under operation. Source: Cadence

Heat is becoming every hardware engineer’s shared nightmare, and it sets up some vicious cycles that are difficult to break and to model up front:

  • Heat accelerates the breakdown of dielectric films (time-dependent dielectric breakdown, or TDDB) used to safeguard signals, and it adds mechanical stress, which can cause warping.
  • It accelerates electromigration and other aging effects, which can shrink data paths. That adds more heat due to higher resistance in circuits and an increase in energy required to drive those signals, until they are re-routed (if possible).
  • It can impact the speed at which memory operates, slowing overall performance of a system.
  • It creates noise, which impacts signal integrity. That noise can be transient, too, which makes partitioning much more difficult.

All of these factors can shorten the lifespan of a chip, or even a portion of a chip. “Thermally degrading transistors can easily make or break a chip or IP,” said Pradeep Thiagarajan, principal product manager for analog and mixed-signal verification solutions at Siemens EDA. “Fortunately, self-heating analysis of most devices can be done to assess the impact of localized heating on a design by means of a transient measurement for every MOS device, followed by loading that temperature delta data and assessing the impact of the waveforms. Now, novelty is required across the board, given increasing data transfer rate requirements. So the better every thermal interface material is modeled, the higher the chances of addressing those effects and making any appropriate design changes to avoid a short-term or long-term hardware failure. The net is that you need novel thermal solutions, but you also have to model it properly.”

Power issues abound
Many chipmakers are just starting to wrestle with these issues, because most chips are not developed at the most advanced processes. But as chips increasingly become collections of chiplets, everything will have to be characterized and operate under conditions that are foreign to planar chips developed at 40nm or higher.

What is not always obvious is that increasing transistor density, whether in a single chip or inside an advanced package, isn’t necessarily the biggest knob to turn for boosting performance. It does, however, increase the power density, which limits the clock frequency. As a result, many of the big improvements are peripheral to the transistors themselves. Those include hardware-software co-design, faster PHYs and interconnects, new materials for insulation and electron mobility, more accurate pre-fetch with shorter recovery times for misses, sparser algorithms, and new power delivery options.

“The understanding of the full system stack is really important,” said Vincent Risson, senior principal CPU architect at Arm. “The computer, of course, has an important contribution to the power, but the rest of the system is also very important. That’s why we have different levels of cache, and the size of the cache is different. We have increased that over the last generation because it’s better to have something local so that the power downstream sees compute as local. And as we scale to 3D, we can imagine having 3D stacked caches, which is an opportunity to basically reduce data movement and improve efficiency.”

The key is to add efficiencies into every aspect of the design cycle, and not just for the hardware. While the chip industry has been talking about hardware-software co-design for the past couple decades, systems companies have prioritized that approach with their custom-designed micro-architectures, and mobile devices are looking to extend battery life significantly further for competitive reasons.

“There is a lot of tuning to extract more, and that’s a big focus for the CPU,” said Risson. “We are continuing to make improvements in all the pre-fetch engines, for example, to improve the accuracy of that and to reduce the downstream traffic. So we have better coverage, but we also initiate less traffic on the interconnect.”

That’s one piece of the puzzle, but more are required. Consider the breakdown of dielectric films over time, for example. It can be accelerated by different workloads or operating conditions, particularly inside a package filled with chiplets. “TDDB is a problem because we have so many signals and so many polygon nets running on different voltages,” said Norman Chang, fellow and chief technologist for Ansys’ Electronics, Semiconductor, and Optics business unit. “If you have a net next to a signal net with a different voltage, then the dielectric will see different voltages. As time goes on, you will see a time-dependent dielectric breakdown. This is a new problem, and we need to come up with a solution for it.”

Inconsistencies
Thermal gradients are another challenge, particularly when they are transient, varying greatly from one workload to another. This problem is particularly acute in 2.5D, where it can cause warpage, and in 3D-ICs, which are expected to roll out sometime in the next couple years. In both cases, heat can become trapped, creating a snowball effect.


Fig. 2: Thermal and mechanical analysis results showing temperature gradients on 2.5D IC, including warpage at 245°C. Source: Ansys

“If you look at the power consumption in a 3D-IC, it’s very much related to temperature,” said Chang. “When the temperature increases, the leakage power will increase, and the thermal gradient distribution is the center of the multi-physics interaction in a 3D-IC. Temperature will affect power, but it also will affect the resistance. The resistance will increase when the temperature increases, and that also will affect the dielectric constant. That will affect the signal integrity and the power integrity, and it will affect the stress. And when you are mixing digital and analog in a 3D-IC, the analog is more sensitive to stress. You have to know where is the thermal gradient, where is the thermal hotspot, because you have to move the analog components away from the hotspot. If you see a thermal cycling for the analog component, you will speed up the aging of the device, you will start seeing a transistor mismatch, and the efficiency of the analog circuit will decline rapidly compared to the digital logic.”

And this is just getting started. Kenneth Larsen, senior director of product management at Synopsys, noted that getting the placement wrong for various elements in stacked die can create unexpected issues such as thermal cross-talk, which also can degrade overall performance. “We’ve gone from monolithic to chiplet-based design, which is disaggregated, and now these devices are getting closer and can influence each other. When you put one device on top of another, how does the heat escape? This is a big challenge. With 3D-ICs, the first concern is can you build systems with structural integrity. But you also have other mechanical, thermal, and power concerns — the whole shebang.”

In the past, the simplest approach to managing heat was to lower the voltage. That approach is starting to run out of steam, because at extremely low voltages the slightest irregularity can cause problems. “Noise is a topic for very low power technologies, like near-threshold or sub-threshold devices, as well as for high-power devices,” said Roland Jancke, design methodology head in Fraunhofer IIS’ Engineering of Adaptive Systems Division. “It’s also a topic that is hardly understood because it typically does not appear in simulation. It appears later on in the real world, and then you have to understand it and cope with it.”

Cross coupling, for example, can create noise in the substrate, but that isn’t always obvious in the design phase. “We started some years ago with a substrate simulator to figure out what are the cross couplings across a substrate,” said Jancke. “You’re thinking about a single device and the neighboring devices. You don’t think about the cross coupling at the input stage, which is far away, but it’s coupled through the substrate.”

These types of issues can cause problems in DRAM, as well, particularly as bit cell density increases, which also makes it more susceptible to noise. “There’s definitely thermal noise,” said Onur Mutlu, professor of computer science at ETH Zurich. “Also, when you access a cell, you’re creating a noise in the structure because of the electrical interference caused by the toggling of wires, for example, or the access transistor. That activation causes noise, and this leads to some reliability issues. We call it cell-to-cell interference. The row hammer problem is just one example of that. You’re activating one row and causing disturbance in adjacent rows. RowPress is another example, where you keep one row open for a longer period of time, and this disturbs what’s happening in other rows adjacent to it. This sort of cell interference is getting more prevalent as we reduce the size of each cell and put cells closer to each other and increase density. You can get silent data corruption, and that may be what’s happening in the field.”

With power, there are always unexpected issues. “For whatever clock frequency you’re running at, you’d like to run it at the lowest voltage possible, because that’s where you’re going to use the least amount of energy,” said Barry Pangrle, power architect at Movellus. “There’s a certain amount you can model, but as with any models, sometimes you have surprises. I can take a chip, run it under different conditions, and I can play around with the voltage and frequency and get an idea of where it will work under different workloads. ‘Okay, I can use these points, and if I want to be a little bit more conservative, I can always back off a bit and put in a little bit of margin.’ But people aren’t going to do that for every chip. So do you create bins and say, ‘Okay the ones that fall in this category we’ll run at this clock and this voltage?’ Then, some of the granularity will be left up to whoever is selling that chip.”

Other issues
There also is a monetary aspect to power, and that spans everything from the resources required to create a complex design, to the amount of power consumed in a data center. The higher the transistor density, the more energy it takes to power up and cool down a rack of servers. And with various flavors of AI, the goal is to maximize transistor utilization, which in turn consumes more power, generating more heat and requiring more cooling.

“These applications are drawing huge amounts of power, and they’re exponentially rising,” said Noam Brousard, vice president of engineering solutions at proteanTecs. “Efficient power consumption will eventually translate into significant savings in the data center. That’s number one. Aside from that, we also have the environmental impact. And, we want to extend the lifetime of the electronics.”


Fig. 3: Impact of power on chips. Source: proteanTecs

Nor are power-related effects confined just to a chip. “With 2.5D, thermal stress is going to cause warpage, and because of that you run the risk of breaking the balls that connect the substrate to the PCB,” said Cadence’s Roshandell. “If it cracks, you get a short, and then your product is not going to work. So how you address that, and how you model it, is important. It has to happen in the earliest stages of the design where can envision it and do something about it.”

Things get even more complex in 3D-ICs. Once again, the emphasis is on sizing up the problems early in the design cycle, but in 3D-ICs there are additive effects. “Dynamic switching power is really tricky for 3D-ICs compared to an SoC,” said Ansys’ Chang. “We have to consider the physical architecture as early as possible, because if you have 15 chiplets in a 3D-IC, how do you partition the power among the 15 chiplets for dynamic workflow and time dimension? At a different time you may have a different workload on that chiplet, and that may create a thermal hotspot. But if the top die has a local hotspot and the bottom die also has a local hotspot, if the two local hotspots line up at a certain time, then that hotspot will become a global thermal hotspot. It may be 10 or 15 degrees hotter than the local hotspot if the other die is not switching. This caught 3D-IC circuit designers completely off guard, because when you run emulation for a chiplet in a 3D-IC, you probably cannot run an emulation for the whole 3D-IC with a realistic workflow.”

The problem is that there are so may dependencies that everything needs to be understood in context of something else. “There is no way you can optimize these devices independent of each other,” said Niels Faché, vice president and general manager for Keysight’s design and simulation portfolio. “You might have an objective around thermal, such as maximum temperature, heat dissipation, but you need to understand that in the context of mechanical stress. You have to be able to model these individual physical effects. If they are very tightly coupled, you have to do it in the form of a co-simulation. We do that, for example, with an electro-thermal simulation. So when you look at the current that flows through a transistor, it’s going to have an effect on heat. Then, heat has an impact on electrical characteristics, which changes the electrical behavior, and you have to model those interactions.”

Solutions
There is no single, comprehensive solution for power-related issues, but there are plenty of partial ones.

One approach, and probably the simplest, is to limit overdesign. “It all starts with focusing on the target use cases and defining the necessary features to address them,” said Steven Woo, Rambus fellow and distinguished inventor. “It’s tempting to add features here and there to address other potential markets and use cases, but that often leads to increased area, power, and complexities that can hurt performance for the main applications of the chip. All features must be looked at critically and judged in an almost ruthless manner to understand if they really need to be in the chip. Each new feature impacts PPA, so maintaining focus on the target markets and use cases is the first step.”

This can have a significant impact on the overall power consumption, particularly with AI. “In AI there are many options to consider, especially for edge devices,” Woo said. “Some options include how the chip will be powered, thermal constraints, if it needs to support training and/or inference, accuracy requirements, the environment in which the chip will be deployed, and supported number formats just to name a few. Supporting large feature sets means increasing area and power, and the added complexity of gating off features when they aren’t in use. And with data movement impacting performance and consuming large amounts of the power budget, designers need a good understanding of how much data needs to be moved to develop architectures that minimize data movement at the edge.”

Another approach is to run real workloads on a design. “What some customers are doing is saying, ‘Let’s run representative workloads because we don’t know what we don’t know,” said William Ruby, senior director of product management for low power solutions at Synopsys. “It’s like power coverage. ‘What do we believe is a sustained worst case? What do we believe is a good idle type of workload?’ But what they don’t know is how a new software update may change the entire activity profile. Hopefully it’s an incremental change and they’ve somehow budgeted for that, as opposed to being pessimistic and a little bit more conservative. But how do you predict what’s going to happen with a firmware update?”

Backside power delivery is another option, particularly at the most advanced nodes. “At some point you hit diminishing returns because you’ve got the stuff from the top layers down to the bottom, and a lot of time the stuff in the top layers is your power and ground routing,” said Movellus’ Pangrle. “If you can deliver that from the backside, and you don’t have to go through 17 metal layers up top, that’s a lot of layers you don’t have to go through. Being able to bypass that whole metal stack and come in the back door so you can be closer to the transistors and not have to worry about going through all those vias is like manufacturing magic.”

Using sensors inside of chips and packages to monitor changes in power-related behavior is yet another approach. “In the field there are many things that can degrade performance, so we have to bake in voltage guard-bands,” said proteanTecs’ Brousard. “We know there will be noise. We know there will be excessive workloads. We know that the chip will go through aging. All these factors force us to apply more voltage than is necessary in best-case VDDmin.”

On top of that, copper wires can be used to conduct heat to where it can be dissipated. “You can do simple things like optimizing TSV placement with stacked die, and you may be able to use thermal vias, as well,” said Synopsys’ Larsen. “It’s very complex, but we have always dealt in exponentials in EDA. It’s things we will go and solve. But when you want to mitigate something, you add something that takes away some of the values you’re looking for, and that has to be addressed. For reliability, you may add in redundancies, which could be TSVs or hybrid bonds in the stack.”

Conclusion
Power has been a problem for leading-edge chipmakers for the past couple decades. A smart phone will send out a warning that it is running too hot and shut down until it cools off, and a rack of servers may shift a load to another rack for the same reason. But chips increasingly are decomposed into various components and packaged together, and as industries such as automotive begin developing chips at 5nm and below, power issues will fan out in all directions.

Architecture, place-and-route, signal integrity, heat, reliability, manufacturability, and aging are all tightly coupled with power. And as the chip industry continues to combine different features in unique ways to address unique markets, the entire industry will need to learn how to work with or around power-related effects. Unlike in the past, when only the highest-volume chipmakers were concerned with power, it will be the rarer design that can ignore it.

—Ann Mutschler and Karen Heyman contributed to this report.

Related Reading
Next-Gen Power Integrity Challenges
Dealing with physical and electrical effects in advanced nodes and stacked die.
Backside Power Delivery Adds New Thermal Concerns
Lack of shielding, routing issues, and new mechanical stresses could have broad impact on standard cell design.
3D-ICs May Be The Least-Cost Option
Advanced packaging has evolved from expensive custom solutions to those ready for more widespread adoption.



Leave a Reply


(Note: This name will be displayed publicly)