Spray And Pray Wastes Power

Given the importance of power to many chip designs, it is amazing how few tools take power seriously.

popularity

For quite some time I have felt that the way the industry approaches power is less than optimal. Techniques such as clock gating and power gating have been used to reduce the amount of unnecessary activity and leakage, but is there more activity that does not contribute to an intended action?

While unnecessary activity may be unimportant in the functional sense, it all represents power that is being wasted. Some power is intentionally consumed in the hope that it increases performance, like branch prediction, and while that is wasted in some cases, it provides benefit in others. At the micro level, power is wasted from glitches, while at the macro level there could output that is being produced and ignored. A gross example of this happens on my PC every day. The screens go to sleep and yet the GPUs continue cranking to deliver content to them. Why? Quite simply, there is no back pressure telling anything when work is unnecessary.

One reason that promotes this kind of waste is the predominant verification strategy being used — constrained random test pattern generation. This was seen as a huge advance back in the 1980s because it allowed stimulus to be created automatically instead of all verification runs being manually created and maintained, which was a huge task. Constrained random used human effort to create models that were then used to generate stimulus. But they sprayed stimulus into the design and hoped it would do something useful.

Because the stimulus was random, a model of the system had to be generated so that you could see if what happened was correct. Another advancement was also required – coverage – that would give an indication if anything useful was happening in the design. From these results, it allowed verification folks to decide if a particular random seed increased design coverage and if it should be used again for regression runs. In some cases, nothing actually cares if the right activity is being created, and unnecessary activity is actually an advantage if it checks off more coverage.

As design sizes have grown, this methodology has become increasingly wasteful and difficult to get the required coverage. Additionally, the fact that the coverage metrics are only proxies for actual functionality makes it difficult to tie things back to requirements and specifications. Even more troubling is that if coverage is not directly linked to results checking, the coverage has absolutely no meaning. It is checking off a box without actually having done the work.

About a decade ago, EDA companies and standards bodies attempted to create a new verification methodology that would start from the target goals of a device and work out how to make that happen in the design. This effort was called Portable Stimulus — a name whose historical context was dubious, and which only adds to the confusion about what it can do.

But the semiconductor industry does not like change, and it will resist any changes until there is no alternative. Chips have been limited by power and heat for some time. As the thermal density of new technology nodes increases, an increasing number of designs contain large amounts of dark silicon. This is necessary to absorb excess heat from neighboring elements.

When looking at the power crunch, it is happening in multiple ways. At the macro level, the industry would rather spend more time and money removing heat from a package rather than reducing the amount of heat generated. At the design level, power/energy/thermal is still treated as a second-class citizen.

Back to Portable Stimulus. With PSS, the end goal is defined first. Perhaps you want to see data move from point A to point B. A suitable tool will find out how to make that happen. It starts with a necessary action and looks at the preconditions for that to start, then works out how to make those happen, which in turn creates more preconditions — until you satisfy everything necessary to complete the goal. That creates a reverse cone of activity, similar to that developed by formal verification tools. However, formal is trying to prove that something is always true or false, whereas PSS is trying to find examples that satisfy the goal.

When a stimulus set generated by PSS is ‘played’ through the design, it should achieve the desired goal. If not, there is an error in the design. But it is also possible that it generates additional activity outside of the intended cone of influence. This could be an unnecessary and wasteful activity.

Now let’s move on to power. This could be an important tool for power optimization. If the activity generated by a simulation is compared to the required activity to accomplish a desired goal, as generated by something like a Portable Stimulus tool, then any unnecessary activity is potential power wastage. It should at least be a candidate for investigation. There will be some activity that is background noise, like counters being used to create asynchronous events, and of course, that may also trigger software actions, but they are at a much higher level of activity and should be easy to isolate.

What is needed is a form of coverage to be implemented on the design side that can be compared to the coverage generated by PSS. Finding wasted energy and reducing it saves in so many ways, and if less power is consumed, less cooling is required. If just a fraction of the time and money being spent on cooling was invested in reducing energy consumed, a lot of progress could be made.



Leave a Reply


(Note: This name will be displayed publicly)