Managing Voltage Variation

How to regain lost power and performance, and what to watch out for when you do.

popularity

Engineers make many tradeoffs when designing SoC’s to better meet design specifications. Power, Performance and Area (PPA) are the primary goals and all three impact the cost of the implementation. For example, higher power and performance can both require more expensive packaging for power and signal integrity as well as cooling. The larger the die area the fewer die per wafer which drives up cost. Using more metal layers increases the wafer cost but sometimes it’s recovered with reduced die sizes. All in all, it’s complicated. One solution doesn’t work for all circumstances and the same is true when looking to manage voltage variation.

Designing a power distribution network (PDN) involves investigating the distribution of current from the board level, where power management ICs (PMICs) operate, all the way through the delivery to the chip and down to the power routing to the cells containing the actual transistors. If the current demand is too high the voltage will drop. It’s entirely possible for a usable voltage to appear across a cell’s VDD and VSS pins one moment and then due to activity, either locally or at a distance across the chip, for that voltage to drop and fluctuate.

At the 60th Design Automation Conference recently held in San Francisco, Shane Stelmach gave a presentation, “System Power Integrity Analysis: from the PMIC to the Transistor[1], where he reported on an approach to incorporate the dynamic voltage effects across the whole system.  He proposed substituting a piecewise linear (PWL) model derived from SPICE simulations for the voltage variation at the die bump and incorporating that into the die level voltage drop analysis. His results indicated that using SPICE derived models of bump voltages greatly improved the turnaround time (TAT) and capacity of these analyses. Improvements in flows and tools certainly aid engineers in addressing these issues, but the complexity of designs keeps increasing and especially with chiplet approaches that are using higher die to die communication. In cases where voltage fluctuations fall outside of the specified design corners causing timing and chip functionality to fail, designers can 1) redesign the PDN, 2) Increase the voltage to create more margin to handle a larger voltage drop or 3) reduce the activity level on the chip.

The above-mentioned remedies all have drawbacks. Redesigning the PDN can be a costly proposition in terms of design time, time to market and engineering resources.  Increasing the voltage is a straightforward fix, but it negatively impacts power as well as having thermal and reliability downsides. Reducing activity typically means cutting the clock frequency and/or reducing the amount of computation, thus reducing the performance. We’re going to look at two scenarios, one that requires a more significant change impacting the PDN and another that opens itself to a more adaptive approach.

At this year’s 70th ISSCC held in San Francisco, CA, Graphcore presented a paper entitled “Wafer-Level Stacking of High-Density Capacitors to Enhance the Performance of a Large Multicore Processor for Machine Learning Application[2]. The design, known as the Colossus Mk2x, is targeted toward large-scale AI workloads.


Fig. 1: Graphcore Colossus Mk2x Operating Cycle[2]

Figure 1 shows the unique operating cycle for the Graphcore design. It goes through 3 different phases, Synchronize, Communicate and Compute. Each phase consists of a different activity profile that is represented above. Their diagram illustrates a roughly 6x variation in the “switching capacitance” going from just over 50nF to over 300nF (presumably approximately per tile and there are 1472 tiles + 64 spares). This phased based variation in combination with the periodic nature of the chip going through these different phases puts an incredible amount of stress on the PDN to try to maintain a stable voltage.


Fig. 2: Original Graphcore Mk2x Dynamic Voltage Variation[2]

Figure 2 shows the impact on the original PDN for the Graphcore design. There’s a clear periodic oscillation at ~25MHz with the voltage ranging between ~700mV to ~945mV or roughly a ±15% swing in voltage which is well outside typical ±10% designer corners and even further from the ±5% voltage range more voltage aggressive designs will target.

Clearly, this example falls outside the range of minor changes. Since the required current is a function of the switched capacitance multiplied by the clock frequency, cutting the clock frequency would reduce the size of the voltage excursions in Figure 2., but this also implies a significant impact on the performance of the part. Increasing the voltage to maintain Vmin would imply a negative impact on power and reliability. Graphcore chose the route of significantly modifying their power distribution network by bonding another piece of silicon, using TSMC’s Wafer-on-Wafer (WoW) technology, containing many deep trench capacitors (DTCs) to the original chip. Graphcore estimates that this was good for ~750mF of additional “decoupling” capacitance and it is enough to have a dramatic impact on the voltage stability and drastically reduce the dynamic voltage variation for the Graphcore design.


Fig. 3: Graphcore Mk2x with Additional Bonded Chip Capacitance[2]

Figure 3 shows the impact on the dynamic voltage variation with the additional capacitance added into the power distribution network. The plot in Figure 3 is at twice the resolution (20mV/div) versus the one shown in Figure 2 (40mV/div). The dynamic voltage variation is virtually eliminated and there’s a residual IR-drop of about 40mV (~4.7%) remaining. This enables Graphcore to run their chips at higher frequencies more reliably without incurring large voltage excursions on their PDN.

While quite effective, the solution does come at the additional cost of having to use another wafer for the DTCs and then machining it down and annealing the two wafers together using WoW bonding. For a large, high-margin part like the Graphcore Mk2x, this is within an acceptable range for the total cost of the assembled part. For many other applications though, this process would be deemed too costly. Next, we’ll look at a scenario that is more common in many designs today.


Fig. 4: Example Dynamic Voltage Drop Response Curve[3]

Figure 4 shows an example waveform response for a design described in “Dynamic Voltage (IR) Drop Analysis and Design Closure: Issues and Challenges[3]. This example has been used numerous times as an example illustration of how a system could respond to an increase in switching activity. While the dynamic behavior causes the voltage to drop below an expected “steady state” IR-Drop level, the overall difference isn’t that great. Enabling a voltage droop mitigation scheme could still be useful in this situation, but the benefits are constrained by the overall difference between the IR-Drop and the dynamic voltage drop.

In a recent Semiconductor Engineering article, “Battling Over Shrinking Physical Margin In Chips”, Marc Swinnen, director of marketing at Ansys, is quoted as saying, “They’re [chip designers] seeing a mysterious 10% drop in frequency due to voltage drop situations they didn’t anticipate, and it’s probably due to dynamic voltage drop, which has become completely dominant over the good old static proper voltage drop.” This means that the DVD response curves are looking more like Figure 5 than they are like the one in Figure 4.


Fig. 5: Example of Current Leading Edge Design DVD Response Curve

This more common scenario now is much more inviting to Dynamic Frequency Management (DFM) techniques to address DVD. As discussed earlier, cutting the clock frequency is one way to reduce DVD, but it was also noted that this reduces the performance. What if it were possible to have both, reduced DVD, and a fast clock? Designers create memory (cache) hierarchies, with the notion that we want to access a smaller faster memory most of the time with a larger, slower memory (or memories) behind it and create the illusion of a fast large memory. With DFM, we want to operate at the fast clock most of the time and only reduce the frequency for very short periods to mitigate any voltage excursions. This creates the effect of a system running at a fast clock, at the expected operating voltage that doesn’t experience large voltage drops.  Such a system would require a detector to sense when the voltage is changing and a dynamic clock frequency management system to quickly respond by changing frequencies as needed to address the drop in voltage. By closely coupling these components it’s possible to achieve the desired outcome of a system running at the desired clock frequency and voltage while avoiding large voltage excursions. It is expected that more designers will consider this approach in the future to reduce margins and obtain better power and performance ratings.

Design margins are eating away at larger portions of power and performance and designers are looking for ways to claw back some of that lost power and performance. In terms of handling dynamic voltage droop, designers can 1) redesign the PDN, 2) Increase the voltage to create more margin to handle a larger voltage drop or 3) reduce the activity level on the chip. As shown, in extreme cases it may be necessary to go back and redesign the PDN. For many of the dynamic voltage stability issues today’s designers are facing though, an elegant dynamic frequency management system composed of a tightly coupled detector and quick clock frequency manager presents the opportunity to create designs that operate at the desired voltage and at a clock frequency that’s practically imperceptible from the desired frequency.

References
[1] Shane Stelmach, “System Power Integrity Analysis: from the PMIC to the Transistor”, 60th Design Automation Conference (DAC), 2023.

[2] Stephen Felix, Shannon Morton, Simon Stacey, John Walsh, “Wafer-Level Stacking of High-Density Capacitors to Enhance the Performance of a Large Multicore Processor for Machine Learning Applications”, 70th International Solid States Circuits Conference (ISSCC), 2023.

[3]  S K Nithin, Gowrysankar Shanmugam, Sreeram Chandrasekar, “Dynamic Voltage (IR) Drop Analysis and Design Closure: Issues and Challenges”, 11th International Symposium on Quality Electronic Design (ISQED), 2010.



Leave a Reply


(Note: This name will be displayed publicly)