Pain vs. gain in optimizing the power delivery network in complex chips and packages.
The power delivery network (PDN) is a necessary overhead that typically remains in the background — until it fails.
For chip design teams, the big question is how close to the edge are they willing to push it? Or put differently, is the gain worth the pain? This question is being scrutinized in very small geometry designs, where margins can make a significant difference in device performance.
The design of the PDN for a semiconductor device may not be the sexiest job in the design community. Someone has to stitch up the power and ground rail to pretty much every transistor in that device. Miss one, or fail to correctly estimate the current that needs to flow along any particular wire segment of that network, and the device will fail — sometimes spectacularly, other times in much more subtle ways.
Like verification and other such tasks, the PDN adds nothing to the functionality of the device (although it does enable it). However, a weakness in the associated methodology can remove functionality. Tasks that are seen as overhead are funded only to the level that risk is reduced to a manageable threshold. The tricky part is that setting the bar too low on the PDN, which consumes resources in the device, can make your offering non-competitive.
It is easy to add excessive margins early in the flow because your analysis is less than complete, but that is wasteful and causes problems further down the line. In addition, floor-planning could make a significant difference in the amount of overhead that the PDN will consume, but that requires early analysis on less than complete data.
Today, design teams are looking to reduce waste wherever they can. At the same time, the complexity of the analysis requires increasing amounts of processing power and knowing intimate details about how the device is going to be used in the field, something which is not always possible. Temperature affects power, which affects temperature, which affects timing, which affects aging. They are all entangled with each other.
How much risk are you willing to accept?
There are several things that contribute to this question. “Modern compute-systems exhibit peak currents that have been steadily trending upwards due to reliance upon specialized execution and reduced efficiency gains through technology-scaling,” says Shidhartha Das, distinguished engineer at Arm. “Rising peak currents have performance implications, as increasing voltage droop due to resistive grids and inductive package parasitics need design-time margins. Similarly, reliability metrics are also compromised due to electromigration concerns. This makes the design of PDNs a formidable challenge in modern compute-systems.”
Another issue is shrinking voltages. “The days when you had 1.3 volts, where you could afford to lose a few millivolts on margins are long gone,” says Marc Swinnen, director of product marketing at Ansys. “We’re down at fractions of volts, and you really can’t afford to waste it on margin just because you can’t analyze things properly.”
The problem is that analysis is becoming more complicated with every node shrink. “The PDN is an area where electrical and mechanical requirements work against each other,” says John Parry, strategic business development manager for Siemens EDA. “In the package substrate, the stiffness of copper and its CTE mismatch with other packaging materials presents challenges. Power delivery within the die surface metallization, normally aluminum, increases as the current draw increases. In addition, the wires closest to the transistors shrink at each process node, with the thinner wires offering more electrical resistance to the current, and so dissipate more heat. Getting the required amount of current into and out of a package and to where it is needed on the die surface, with an acceptable voltage drop, is now an extremely tough issue and one that will get worse if the supply voltage is further reduced.”
Setting the goal
Establishing the goal for the PDN is the first step. “Whenever you’re doing a PDN analysis, you should establish an IR drop threshold,” says Mallik Vusirikala, director, product specialist at Ansys. “You use that IR drop threshold in the voltage-to-timing sign-off. For example, if you set an 8% voltage drop threshold, you can analyze that and make sure that you are passing that threshold. It is a fixed IR drop margin that you pass into your timing.”
Every company sets a different goal. “There is constant pressure to minimize the overall power consumption while enhancing the performance per watt,” says Ravindra Gali, senior staff product marketing engineer at Xilinx. “At Xilinx, our PDN target for core power rails has historically been specified at 3% of the nominal voltage. However, the nominal voltage for the core power rail has been decreasing from one process node to the next (28nm – 1V; 20nm/16nm – 0.85V; 7nm – 0.72V), thereby reducing the actual PDN margin in millivolts.”
Even without a goal adjustment, the target becomes more difficult to reach with each node. “Customers don’t reduce the margins as they go from node to node,” says Ansys’ Vusirikala. “One reason is the complexity of IR drop analysis. With millions of instances switching, you either do a vector-based analysis or a vectorless analysis. You’re trying to analyze just a few scenarios in the power grid. The lack of coverage in the analysis is a significant showstopper for reduction of the margins. They do not have the confidence that there might be some other scenario which hasn’t been analyzed, which could happen during deployment.”
Failure to identify those scenarios can have dramatic consequences. “The problem with worst case is that it’s hard to determine exactly what the worst case is,” says Ansys’ Swinnen. “Which combination of switching elements will impact the power distribution network the most? It used to be that static voltage drop analysis was dominant, meaning that you don’t consider activity, or you just consider average activity. Dynamic voltage drop looks at activity that is going to cause a local dip in voltage. That has become much more dominant.”
Selecting those critical scenarios is a crucial part of the methodology. “Our on-chip PDN design methodology takes into account various use cases across different applications,” says Xilinx’s Gali. “Then we come up with representative current profiles to mimic the worst-case conditions. We go through extensive simulation analysis and hardware validation to ensure our power integrity (PI) solution has enough margin against the various worst case current profiles.”
Dynamic analysis is becoming widely adopted. “Dynamic margin optimization is one of the design techniques that has shown considerable promise in recent years,” says Arm’s Das. “Instead of relying upon statically-budgeted margins at design-time, runtime optimization of margins achieves performance objectives while still ensuring robust execution. This is particularly evident in the management of large-magnitude inductive droops, where techniques such as AC noise compensation, integrated voltage-regulation, and adaptive-clocking are increasingly being deployed in high-end systems. While these techniques have a demonstrable impact on system efficiency, they still need to be signed off for robust operation, thus complicating post-silicon test and tuning. Consequently, there has been a considerable investment in product engineering and industrial research to ease the barrier to the deployment of these margin optimization techniques in high-volume manufacturing.”
Some of the factors also have become layout-dependent. “When an aggressor switches, it can cause its neighbors to suffer a voltage drop,” adds Swinnen. “The hard part is discovering which set of aggressors to consider. But you still can’t assume that they all switch together. That would be way too pessimistic, and you would never get a chip out the door that way. So you need some way of confidently knowing that you’ve found the worst case.”
Early planning
PDN design used to be a back-end task, often left until after the design was complete, but that is no longer the case. “When teams do floor planning, they do comparative parametric studies,” says Melika Roshandell, product marketing director at Cadence Design Systems. “When comparing design alternatives A and B, they may see a 3% improvement in some design criteria. Then they start evaluating them for other factors, such as thermal or area. The PDN team will determine if they can accommodate the better design alternative. The accuracy of power estimates used is not that important if you can do parametric studies.”
Early planning is even more important when 3D systems are being considered. “The emergence of 3D integrated designs creates conflicting constraints between efficient power-delivery and thermal-management,” say Arm’s Das. “Integrated regulation can provide better control of voltage-transients in 3D stacks, although at the expense of creating potential hotspots around power-FETs. Careful floorplanning of power-FETs and accurate runtime power-introspection are some of the key tools that need to be developed for high-performance system-design that are increasingly limited by power-delivery and thermal concerns.
There are additional concerns with designs that utilize multiple voltages or multiple dies. Power domains may not be independent. They may share the same ground pin and that creates coupling between them. “You have to look at the whole thing, starting from the voltage regulator on the PCB that goes to the package,” says Gary Yeap, senior R&D manager for the Design Group at Synopsys. “Then you look at the package and you look at the stacking, you look at the interposer or the base or the active interposer, and then you look at the whole thing. You need to analyze the whole thing. You cannot analyze a single die, and then assume an ideal power supply coming from the next die as you go through the die stack. You will have to analyze the whole system. There is a lot of emphasis on multi-die analysis tools, and optimization tools, and that includes signal, that includes power. I need to know the IR drop going through these multi-die stacks. I need to know the IR drop-off every individual chip on the silicon interposer, including the silicon interposer itself.”
Considering aging
In some industries, other factors have to be taken into account. “When transistors age, the drive is not as high compared to a freshly fabricated transistor,” says Vusirikala. “The power supply noise may not change, but the impact on timing is much higher when you talk of aged transistors.”
That information is baked into the libraries. “Typically, the way aging is accounted for is by characterizing the standard cells with models which are aged models,” adds Vusirikala. “These are known as end-of-life models. There are SPICE models for regular mode of operation, and SPICE models for aged transistors. You can characterize the cell for delays with the aged model. That is how aging is accounted for in timing margins, but it is different from IR drop impact on to timing.”
This is particularly important to understand with devices being used for extended lifetimes of a decade or more. “Aging impact voltage drops, but the bigger problem is timing,” says Swinnen. “It increases the number of simulations you have to do and increases the number of libraries you have to drag around. But the more fundamental problem is that aging is activity-dependent. If a part of a chip is very active, but other parts are rarely on, they age at different rates. The timing of these transistors will not all age together. One part of the chip will be younger than the other, which means if you have a path that crosses from one to the other, you’re going to see much more skews on the clock. The younger part of the chip is still delivering the clock very quickly, but the older part is a bit slower. The effective skew on your clock is increased, which means that your timing can either fail or you have to take that into account. The industry hasn’t quite solved that problem yet.”
Some of this can be compensated for by using appropriate voltage monitors in the design. “From a purely technical point of view, you can have ring oscillators measuring the voltage performance, you have LDOs, which can be controlled by measuring the performance of ring oscillators,” says Vusirikala. “If you reduce your voltage, your throughput goes down. This is a way to deal with yield.”
But there are limitations. “Voltage monitoring is more suitable for static voltage drop,” says Swinnen. “With dynamic voltage drop, you’re talking about a very short transient dip that is very local. It’s hard to put chip-wide controllers to control what’s happening around a handful of gates over these fractions of a nanosecond. As the gates become smaller, the switching speeds are higher. As we go to new technologies, the resistivity of the power distribution network increases. That means, from the point of view of a gate looking at its supply wire, it’s supply pin is far away electrically, with lots of resistance between the local transistor and the supply pin. When the local power dips, current can rush in from the distance, but it has to go through so much resistance, and it takes a while to get there.”
Considering safety
Another requirement for certain markets is safety. “There are additional monitoring circuits that come into play for safety critical applications,” says Gali. “These adds additional design complexity and margin requirements that needs to be factored in as part of the PDN design methodology.”
Does the PDN represent a single point of failure for a chip? “Unlike signal nets, power nets actually have a lot of redundancy,” says Vusirikala. “Even though you call it a single net, the number of geometries of the power net are humongous. Think of a simple cell which is sitting on a metal 1 rail. You have vias on either end. That is redundancy. But for signals, there is only one driver to one receiver. The impact of a single signal net failure is much higher than power net.”
Another problem is electromigration. “Electromigration is temperature dependent,” says Swinnen. “The higher the temperature, the more mobile the atoms become, and the worse the electromigration becomes. When you calculate the lifespan, it has to be temperature informed.”
“Electromigration is a really interesting problem because you cannot test for it,” adds Swinnen. “You can test until you are blue in the face, and the chip works fine. But six months down the road it’s going to fail because of a creeping electromigration failure. There’s nothing you can do in the test lab to protect from that. The only way to protect yourself from electromigration failure is to design it in. It’s a pure design issue and cannot be tested or detected until it happens.”
This is only an issue for power nets. “The currents that go through power nets are higher and unidirectional, but these metal wire segments are much wider than signal nets,” says Vusirikala. “Signal nets are always routed at minimum width, but power nets use non-default rules, and as a result they are thicker and wider. When you do power supply analysis for voltage drop, you also do analysis for electromigration. There are statistical ways of doing an electromigration check. Instead of threshold-based checks, you try to see the probability of failure of each and every wire segment. From that you can calculate the probability of failure for the whole chip.”
Conclusion
The power delivery network may not be glamorous, but the task of building it is becoming a lot more difficult. There is scant room for large margins, meaning the only way to do the analysis necessary for optimization is by using more detailed models, and considering an increasing array of inter-connected factors. That takes time and lots of compute power.
There is certainly something to be gained from this work, but there is a lot to be lost from skimping.
Leave a Reply