Designers Face Growing Problems With On-Chip Power Distribution

At the latest nodes, it is becoming impossible to analyze IR drop correctly, leading to chip-killing problems.

popularity

The technology evolution in semiconductor manufacturing has led to chips with ever-higher power densities, which is leading to serious problems with on-chip power distribution. Specifically, the problems surrounding voltage drop—or IR drop (from V=IxR)—have become so acute that we have seen multiple companies starting to get back dead silicon from the fab.

For example, a recent 7nm chip designed to run at 3GHz failed to get above 2.7GHz in silicon. The failure was due to excessive IR-drop on the power and ground supply lines that remained undetected despite passing all signoff tools checks and methodology recommendations. It is this unpredictability that is raising concerns in the design community indicating that we need to reconsider our approach to IR drop signoff.

On a detailed level, IR drop causes chip failures because timing delays of standard cells and macros slow down dramatically if their supply voltage is inadequate. This effect is not new, and designers have been dealing with it for years. But something has changed in manufacturing that is making existing verification methodologies obsolete. The main culprit is the dramatically increased resistance of semiconductor interconnect, which has seen an almost 10X increase from 28nm to 7nm. And this trend is expected to intensify for nodes below 7nm. By contrast, the capacitance has seen very little change across recent nodes. The other contributing factor is the heightened sensitivity of advanced node libraries to variations in supply voltage, particularly in ultra-low voltage and near-threshold operating regimes. High-Vt cells suffer particularly badly from this. We have seen cases of up to 25% variation in delay for a swing of just 10mV at 0.5V.

The increase in wire resistance has also conspired to invalidate traditional techniques for mitigating IR drop. Traditionally IR drop was kept in check by over-dimensioning the power grid and by adding decoupling capacitors. But over-dimensioning is a brute-force approach that is becoming much too expensive in PPA (power, performance, area), and decoupling caps are not as effective anymore because heightened resistance has made IR drop a very local phenomenon. Indeed, there is so much resistance between a local problem and distant capacitors that the time constant for any current surge is too slow to help local, instantaneous dips. Increased resistive shielding makes global solutions like over-dimensioning and decoupling caps less effective. Another consequence is also the effects of local aggressors—near-by standard cells that cause the local voltage to dip when they switch—is accentuated and exerts growing importance over the IR problem. Without understanding the impact of these local aggressors, it is becoming impossible to analyze IR drop correctly.

Today’s voltage-sensitive libraries mean that there are certain paths that are inherently voltage-sensitive because of the combination of standard cells, slews, and loading that they contain. And if just the right set of local aggressors all switch at the right time, there will be a significant local dynamic voltage drop which will cause these path delays to be wildly different from standard delay calculations that fail to consider this specific activity pattern around this specific path.  Such paths can descend into timing failure even if they originally had plenty of positive slack and were on no one’s radar as a ‘critical’ path. In the IR timing failure example mentioned above, the culprit was shown to be a path that was not timing critical at all. In fact, it was only ranked past the 200,000th rank of critical paths.

That is why we are seeing some designs at 7nm and below that pass all traditional IR signoff methodologies with flying colors fail in silicon on the testbench.

Given this analysis, the outlines for a solution present themselves:

  • IR-drop analysis must become much more closely linked to static timing analysis (STA) so that IR drop-aware timing can analyze the actual timing impacts of a given voltage drop on a path. IR-drop tools must become timing-aware.
  • The simulation designs to analyze various activity patterns of aggressor nets must be improved to have a realistic chance to find the one-in-a-million switching pattern that will trigger the real worst-case path failure. This will most likely require some sophisticated version of vectorless activity. Simple random vectorless activity has a vanishingly small chance of finding the real worst-case activation pattern, and vectored analysis won’t have enough vectors to give reliable coverage. Remember, the vector activation must trigger both the voltage-sensitive path and the right set of aggressors at the same time.
  • The IR drop analysis must be integrated into the physical implementation tool so that it can be performed early and often during the design process, and also take advantage of automatic timing-correct ECO fixing techniques available in implementation.

The solution methodology suggested above is not a completely new idea—the concept of timing-aware IR drop has been around for a decade or more. But the sheer volume of data that needs to be pulled together in one place for this (complete layout, full STA with SI, all timing window info, dynamic IR drop simulation, vectorless data generation, etc.) has made the approach impractical, particularly if P&R, IR drop, and STA are being done by 3 different tools from different vendors. So, over and above the detailed algorithmic challenges embedded in this vision, there is an overriding urgent need for a deeply integrated full-flow solution.

I am confident that the EDA community will rise to this challenge as they have done so many times in the past, but designers will have to follow and learn to design with a smarter, more integrated approach to power supply distribution and integrity.



Leave a Reply