On-Chip Power Distribution Modeling Becomes Essential Below 7nm

Why and when it’s needed, and what tools and technologies are required.

popularity

Modeling power distribution in SoCs is becoming increasingly important at each new node and in 3D-ICs, where tolerances involving power are much tighter and any mistake can cause functional failures.

At mature nodes, where there is more metal, power problems continue to be rare. But at advanced nodes, where chips are running at higher frequencies and still consuming the same or greater power, a lot more current needs to feed into the chip in a much smaller area. Understanding where the power is going, how it gets there, and what can disrupt the flow of electrons is becoming a major challenge.

“Power density is increasing, and because of that voltage drop is increasing, as well,” said Rajat Chaudhry, product management director at Cadence. “Transistors need to operate at a lower voltage relative to 10 years ago, and that means there is less room to play with how much drop we can have on the power network. We need to be much more accurate about our analysis to make sure the transistors are getting the right voltage so the chip operates at the right frequency.”

That has an impact on timing analysis, during which it is expected that the transistors will have a suitable voltage available at their terminals. “For example, you may do the timing while assuming the voltage doesn’t drop below 0.8 volts — or there is a variation of maybe 20 or 30 milliwatts around that.” Chaudhry said. “If the voltage at the transistor terminals drops a lot, or goes much higher than that, it can cause timing failures and functional failure of the chip.”

Others agree. “As transistors are being switched faster, at very low voltages, designs can’t afford to waste any voltage from drop,” said Marc Swinnen, director of product marketing at Ansys. “In the past, if 200 millivolts was lost to drop on 5 volts, no one cared. It was still plenty for the transistors to switch properly. But if you have a fraction of a volt, you can’t afford to lose any of that in voltage drop along the way. While transistors are switching faster, metal layers have also gotten thinner and narrower, so the resistance of the metal layer has been going up significantly. Suddenly, more power has to be squeezed through longer wires that are thinner, so the voltage drop problem has become increasingly acute, and the power supply network must be analyzed to determine the voltage drop.”

Modeling power distribution was not considered essential in the early days of chip design.

“The power supply consisted of power and ground rails, and the transistors connected between the power and ground,” Swinnen said. “Each row had a rail, and a double ring went around the chip where all the rails tied into — a power ring and a ground ring. Each rail tied into the right one, and that was it. The power rails were big enough that it wasn’t a problem. Metals were thick and wide. Distances were short. Voltage drop was not an issue you had to deal with much.”

As chip sizes increased, the rails between the rings became so long that the power coming down the ring then had to travel down the rail. But the middle of the rail was far away from the ring, and so voltages starting dropping such that there is a halfway point where the voltage is minimized. While straps were drawn to alleviate this issue, over the past decade, power has emerged as an acute problem.

“Power went from being a non-problem to a minor problem,” Swinnen said. “Now it’s a major problem to the point that it’s one of the significant sign-off tools. The number one technique for reducing power on the chip is to reduce the voltage, and since the power is proportional to the square of the power supply voltage, lowering the voltage has a huge impact on the power across the board. It’s a simple direct way of lowering voltage as voltages have continually gone lower, even to ultra-low voltages that barely scrape above half a volt.”

Modeling power distribution on SoCs is essential for IR drop analysis, but it also is necessary for thermal and timing work.

“Historically, engineers have margined their static timing analysis with thermal distribution budgets and IR drop budgets,” said Javier DeLaCruz, distinguished engineer and senior director of system integration at Arm. “However, this extra timing margining means less performance, so using more accurate thermal and voltage models are needed to leave less performance on the table.”

Modeling on-chip power distribution is especially critical for 3D-ICs, noted Sutirtha Kabir, R&D director at Synopsys. “The concern is that designers were already struggling to get power up to the transistors even for a single die design, and now they’re hearing they have to push power through all these stacked die, and there will be voltage drops everywhere.”

3D-IC has been discussed for years, but it remains an emerging field due to the challenges of thermal dissipation, various types of noise, and complex floor-planning.

“Single-die design has been done for 20+ years,” Kabir said. “There’s a lot of experience and things to look back at. 3D-IC is still very new. It’s not that the design team has done five of these designs and they know exactly how to build it. Now, maybe for the first time, they are worried that, ‘I’m now going to take this power design network early in my design, and it has to be done in the context of the whole 3D-IC. I can’t just say I’m going to design this for its own power design. It has to be the power demand for this IC plus whatever is over there. So if I don’t do even a back-of-the-napkin, really decent calculation very upfront, and have a prototype power delivery network design, I cannot go back and fix this later on.’ If changes are made at a later stage, it will impact something over there and something over here, and you’d have to go back and re-design the entire power network for the whole 3D-IC.”

Cadence’s Chaudhry agrees. “Previously, when you just had the traditional one chip in a package, the power used to come from the board, then through the package and onto the chip,” he said. “Now, there are multiple chips or chiplets packaged together, and sometimes they are stacked on top of each other. In many cases, the power distribution doesn’t just come through the package to the chip. It actually comes through the package, through one chip to the other chip. That adds another level of complexity of modeling the electrical characteristics of the power network. It adds to the size of the power network, and now you have multiple chips. That’s an area where there will be more innovation required, and it already is reflected in industry work with foundries.”

Impact on reliability
Because the power distribution network carries current to all the transistors, over time it causes electromigration. In effect, this drift of electrons can result in structural changes in the wires.

“The more unidirectional current, which happens in the power network, the bigger the electromigration issue,” Chaudhry said. “Analysis is needed to make sure we don’t have electromagnetic defects related to electromigration. We need to measure how much current each wire in the power network will carry, and based on the current density in that wire, how long do we expect it can function without having some structural damage to the wire.”

Alongside of that, the continued shrinking of features requires thinner metal layers, and at 7nm and below the higher resistance value of these layers can cause a lot of localized drop due to simultaneous switching.

“When there are a lot of cells together, switching simultaneously, they can cause very high drops for a very short duration. These drops can cause functional failures, so now we need to also model the switching,” Chaudhry said. “At 28nm or 40nm, we are more concerned about the general level of the drop of the power grid. But now, the big part of the drop comes from the lower level of metals, and in a localized way. So we need to start modeling the localized switching of the transistors, and we need to basically model everywhere. At the same time, we need to understand at what time each cell switches, how it’s switching simultaneously with the others, and this requires higher accuracy about when something switches. We need to also cover many more scenarios, because previously it was, ‘Let me figure out the average power consumption of this chip.’ But now I need to start modeling every local area of the chip, and the possible combinations of switching. And given that cells switch in different parts of the clock cycle, those cells that switch simultaneously are the ones that are going to cause the drop, and you need to be very careful. You need to model when they switch also, so the timing becomes critical, too.”

Complexity related to simultaneous switching and local transient effects became particularly troublesome at 7nm and below.

“5nm was the point when the industry realized it really needed to doing something about it, and the way we handle it is with vector-less modeling, where different switching scenarios are modeled,” he said. “The problem is there are infinite possibilities. Designers can only get maybe 10, 15 vectors, but you have infinite possibilities. So then we have to come up with vector-less methods, whereby the tools give designers the ability to model a lot more switching scenarios. Vector-less methods are becoming more complex, and they are allowing designers to model more of these switching scenarios.”

Another problem is the size of the power distribution network. This is particularly problematic for AI chips, which tend to be extremely large, with a huge number of nodes on a power distribution network. In fact, some have as many as 100 billion nodes on a power distribution network. To simulate that, the tools must be able to handle that capacity and still complete the analysis in a reasonable amount of time.

“Looking at chip with 50 billion transistors, that means there are 50 billion ground and 50 billion power points to connect to,” said Swinnen. “In this network of the power supply, each little piece of wire has to be modeled as a resistor, so you end up with billions and billions of resistors. There are designs now with 60 billion to 100 billion nodes on that electrical model, which has to be reduced so it can be simulated.”

Advances in EDA simulators make that possible. They can simulate these designs to give a point-by-point voltage map of exactly where the current is going and where the voltage is at every point.

“Electromigration comes along with this, and because electromigration is a reliability issue you have to know what the current flowing through all the wires is,” Swinnen noted. “We just calculated that from voltage drops, so we might as well do electromigration, too. And since electromigration is highly temperature-dependent, it means thermal-aware electromigration analysis is also needed.”

All of that typically is defined before place-and-route. “The placement assigns rows that are empty, and that’s where the cells are going to go,” Swinnen explained. “You put the power supply in, and then plot the cells and the rest. The cells don’t just go anywhere. They fit on the structure that’s there. At the planning stage, since the design isn’t placed and routed yet, you analyze based on predictions. And at that point, it is an optimization problem for the whole chip. We’re finding there are lots of degrees of freedom, like how wide do I make my wires? How many straps? What’s the pitch between the straps? You can experiment with lots of aspects.”

AI/ML-driven optimization tools can be helpful here. They can take a number of variables, do some modeling on those variables, and produce a mathematical model for the impact on each of those variables. Then, a Monte Carlo simulation can be run across all possible combinations of these variables to identify the sensitivity of the variables. For example, what is the sensitivity of the pitch versus the width of the wire versus the size and number of vias? These tools can take a very complex, multi-dimensional optimization problem and crunch it into an optimal solution.

However, things are not always smooth sailing. “Let’s say you’ve done place-and-route and you’ve been refining your timing,” Swinnen said. “You’ve done a lot of work to get your timing to close. You do a voltage drop check, and you find some voltage drop issues. You need to fix it, but the fix is often very disruptive to your timing, so it’s always been difficult. How do I fix IR drop without disrupting my timing because it comes late in the flow? The tendency has been to avoid IR drop by paying the price with over-dimensioning the power supply. But since power supply analysis depends on the activity of the circuit, if there’s nothing switching, there’s no power being drawn. So just like in power, activity is central to this. Where do you get your activity from? We have the same issues as we have with power analysis. So we have vectored activity. The user can provide a list of vectors, or there is a vector-less approach, where we calculate it ourselves under the hood.”

3D-ICs muddy the waters, and designers must be aware of through silicon vias (TSVs) that go from the back to the front of the chip and take up significant real estate.

“You cannot do a place-and-route, you cannot place macros where TSVs are,” said Kabir. “And TSVs are what carries the power from the back of the chip to the front. This means if I have stopped my power design, along with my placement, I might not have room to do placement and routing later on. Then, somebody will come back to me later and say I have to punch up TSVs through your macro, and that’s not going to work.”

In addition to static, you also have to worry about dynamic and switching power, he said. “Something in the PCB card actually may end up frying your chip. And if you don’t take that into account and do a system-level power integrity and signal integrity analysis, the chances are good that your system may fail.”

Finally, all of this modeling has to be done earlier in the design flow. “Previously, you would design the chip, and this power distribution check was typically what we call sign-off right before you tape out the chip. You make sure everything’s fine and you get few errors and fix them,” Chaudhry said. “Now, because this problem is becoming so localized, and if you don’t solve this issue really early on as part of the design, you could end up with tons of errors near the tape out date, and you won’t have time to fix them.”

This is another area where the EDA ecosystem is innovating as part of the implementation tool flow, incorporating more power distribution analysis as part of the implementation process.

The same is true for 3D-ICs. Early power modeling and analysis are essential, because once the stack of die is set, it typically can’t be changed without a complete redesign.



Leave a Reply


(Note: This name will be displayed publicly)