Managing P/P Tradeoffs With Voltage Droop Gets Trickier

Higher current densities set against lower power envelopes makes meeting specs more challenging, especially at advanced nodes.

popularity

Experts at the Table: Semiconductor Engineering sat down to talk about voltage droop/IR drop with Bill Mullen, distinguished engineer at Ansys; Rajat Chaudhry, product management group director at Cadence; Heidi Barnes, senior applications engineer at Keysight Technologies; Venkatesh Santhanagopalan, product manager at Movellus; Joe Davis, senior director for Calibre interfaces and mPower EM/IR product management at Siemens EDA; and Karthik Srinivasan, R&D director in Synopsys‘ EDA Group. What follows are excerpts of that conversation. Part two of this discussion is here. Part three is here

SE: Why is voltage droop, or IR drop, such a significant issue today?

Davis: With technology scaling, resistances and capacitances are going up. You hope they’d go in the opposite direction, but they tend not to. Plus, our ability with transistors to drive those RCs is being challenged. And as you see the trend in advanced technologies, if you’re going to make that investment of hundreds of millions of dollars for an advanced node technology chip, you’re going to make it as big as possible. Chips at 3nm tend to be really, really large. You’re driving lots and lots of things across the chip, and it just becomes a massive scaling issue. That means voltage drop is important for both performance and reliability, as well as just making sure that our chips work. If you need your chip to work at 3 GHz, 2.5 GHz doesn’t cut it. We need to make sure the electrophysical, the as-implemented electrical performance, meets the specs that we are designing to. It’s simply more of what we’ve always had as we’ve moved this challenge along.

Santhanagopalan: With the move to advanced process nodes, another thing we’ve noticed is that the current densities get higher and higher in each of these nodes. And as we talk about different applications, there’s always a power envelope that keeps getting pushed down, so you have these opposing forces. The current densities are getting higher, and the power envelope is not able to meet those transients. That causes a lot of these EMIR and voltage droop issues. The other part of it is these workloads also have been changing over time, and the type of data patterns you have in each workload, as well as the dynamic variations of each of those workloads, are causing some complexities.

Chaudhry: I’ll second what the previous two people have said, and add another concern. Since resistance has increased so much at the lower levels, the issue is becoming very localized. As a result, there are a lot of dynamic noise issues due to simultaneous switching of cells, or some big cells being aggressors. What’s happening is that’s really expanded the coverage space. It used to be you could get high-power vectors, or vectors with large Di/Dt. But now it’s becoming a localized problem, especially on the advanced nodes N5 and below. Covering all the various failure mechanisms in the space has exponentially exploded, so that’s a real problem. Then, of course, the power network size has increased a lot, so we need to have tools to simulate that.

SE: What does this mean for meeting design goals?

Srinivasan: When we are talking about successful design, we are talking about power, performance, and area. To meet the power budget, you scale the voltage. To meet your target performance, you switch to a newer node. FinFETs and gate-all-around give you the capability to drive more current through your interconnects which also means it’s going to suck more current through the power grid network. Then, in order to reduce your power envelope, you have to lower the supply voltage. What used to be 1.1 or 1.2 volts in the good old days is now going to 0.6 or 0.55 volts, which is very near to the threshold voltage. That’s one of the main reasons IR drop is becoming a key care-about. The industry is actually looking at this as one of the key things, so behind the power, performance, and area, IR drop is the big thing people are worrying about.

Barnes: Going back to how we’re fighting the larger capacitance and larger resistance, in the power delivery world I like capacitance because it’s charged storage. But if you increase the capacitance, your load requires more power to switch that transistor. You see very quickly there’s a difference of design goals here. You’re trying to reduce the capacitance in your design technology to get faster switching of your devices and require less power. But at the same time, you need really high capacitance to deliver that larger demand for the Di/Dt and get that closer to the load with less inductance in the path. You see very quickly that you need a process technology that can integrate a high capacitive structure for Vd coupling to provide that charge storage, as well as low capacitance on the transistors to reduce the power of the load. What’s interesting is that same problem that is on the IC is the same sort of thing on the package, and the same sort of thing on the printed circuit boards. As you increase the density, your die is getting much larger. It’s simple transmission line theory. You’re starting to feel the impacts of the path inductance — the impedance — between the power delivery and the load. Before, we could use a lot of tools that were lumped element-type simulations. But now, we’re starting to see that, especially in the package world, there’s a lot of EM simulation going on. It’s not just trying to do a SPICE-level RLC-type model. We’re probably going to see more of that moving toward the die. These AI chips and giant semiconductor structures are seeing a lot more of these parasitic issues of distributed power trying to use the same power rail over a large area relative to the DI/DT switching speed. Then there’s also the fact that your margins are reducing, your voltage power rail level is much lower. Your millivolts or microvolts of allowed ripple is getting smaller and smaller.

Mullen: I’d like to echo what was said about simultaneous switching. The activity on neighbor instances has a huge effect on the voltage drop of every victim. This means you need different techniques to analyze that and make sure you’ve covered it. In addition, what people care about is the impact on timing. It’s not just how much voltage drop as an instance PG pins see. It’s whether it causes a timing failure. In this way, the link to static timing is critical. Then, with technology trends, complexity is immense. You now have billions of instances on a single die, and the introduction of 3D-IC systems makes the power delivery distribution a much larger problem and much more complex. We’ve seen cases where the Vdd signals come in through the interposer, go through some dies to a power gate, go through some other dies back to the package. It’s incredibly complex, and having the capacity and performance to handle these kinds of systems is critical.

SE: When it comes to the ways to mitigate voltage droop/IR drop, do we have an agreed upon methodology to approach it today?

Mullen: You want to avoid as much as possible. You want to design a good power delivery network that’s robust and will handle different placements. You want to give early feedback to people about potential problems. We have static checks and other techniques that provide insight into the quality of the power delivery network, independent of the placement. You want to tie closely with early place-and-route iterations to give feedback as early as possible so that you’re not trying to fix things at the last minute. And then there’s also the need to have high-quality analysis that feeds into an ECO and repair process. So at every phase of design, power integrity feedback to the user with actionable information is critical.

Chaudhry: We need to shift left more to bring power integrity into the implementation phase earlier, and have a strategy where we avoid the IR drop problems rather than fix them in the end. Customers on advanced nodes have so many violations at the sign-off stage that they’re basically waiving most of them, or they’re only fixing the ones on the critical paths. It’s becoming an unmanageable problem at the sign-off stage. The EMIR industry focused a lot more on analysis, where we didn’t focus so much on early mitigation. We really need to change our strategy on that and provide a solution where we try to avoid the problem in the first place. We also need to start making it like timing, in the sense that we want to bound the problem because we’re not going to cover every possible scenario. So just like static timing we need to start bounding this problem, and start really early in the implementation phase.

SE: How do you go about bounding the problem of IR drop?

Chaudhry: We will have to use some statistical techniques. Because the problem is becoming more local, it’s getting a bit easier to bound it. You can do a lot of local simulation and try to bound it from that perspective. That’s what we have to do, because otherwise it’s just becoming unmanageable.

Santhanagopalan: From the perspective of voltage droop, if we think about the Di/Dt type of events, we’ve seen designers try to margin workloads for the worst case they might see. This, in turn, affects long-term reliability or performance. The problem is that worst case workloads typically don’t happen all the time, so you’re giving up on that performance for the rest of the period, and there is some margin that’s lost. We’ve also seen architectural ways of handling or mitigating this, mostly by ensuring some type of architectural event that causes known DiDt events doesn’t really occur in that aspect. What designers are trying to do is mitigate the problems by essentially limiting or bounding the performance characteristics. But even in that aspect, some of the dynamic workloads may not be well-known, especially because voltage droop is a system-level problem. You have the architectural portion, you have the on-die optimization, but then you also have the packaging, and then you have both. To have all of these come together in a closed-loop system very early in the process is really hard. While there are different ways and methodologies to make a lot of these things come together, it’s still a really complex system-level problem that designers are trying to understand.

Davis: We’ve been primarily talking about digital here, and as was said, architecture and structure implementation strategies are the key. The challenging piece is that to make things high-performance, you want to put them close together. Putting them close together causes localized IR drop, so you’ve got two forces here that are working against each other. Again, the architectural challenge is to partition so you can get locally good performance and communicate across those modules. That’s really an architectural challenge, as well as an implementation challenge. Going back and forth with the models, you’ve got to have some level of indication of the impact that your architectural changes on IR drop is going to have on implementation. That’s really not a well-established practice today. Designers always want to do IR drop analysis as early as possible, even before they have any implementation. What is the model that you have for that? That’s one of the challenges today. There are structural pieces, like backside delivery of power and current that almost every foundry now offers. There are other aspects. 3D is another way of looking at that problem and partitioning things onto different die so you’re not transmitting across as large a distance. But again, that creates another analysis problem across those 3D-ICs. Designers have a tremendous number of levers to use to address this. Where we are challenged as an industry is having good predictive capability early in the design flow, because you don’t have implementation data. Where are those models going to come from? Then, on the analog side, analog really is a scale issue. As wonderfully accurate for a given SPICE model, you can even do statistical analysis, get statistics out of it, but it runs out of steam at scale due to the capacity of SPICE models. You need a way to be able to do verification and simulation at scale for analog because the precision requirements are much higher.

Part two of this discussion is here.
Part three is here



Leave a Reply


(Note: This name will be displayed publicly)