Gaps In Performance, Power Coverage

Coverage tells us when we have done enough functional verification, but what about power and performance? How do you know you have found the worst case?


The semiconductor industry always has used metrics to define progress, and in areas such as functional verification significant advances have been made. But so far, no effective metrics have been developed for power, performance, or other system-level concerns, which basically means that design teams have to run blind.

On the plus side, the industry has migrated from the use of code coverage metrics to metrics that provide indicators that certain functionality has been executed. While there are still problems with those metrics, such as difficulty defining completeness, they have served the industry well. More recently, though, factors other than functionality have increased in importance, and so far engineers have no indication of whether the design will meet the specs or if the worst-case conditions have been found. And while not all designs need to be optimized for these extreme cases, they are conditions that could cause the chip to fail.

Metrics for power and performance are not as well defined as functionality. There is no notion of pass or fail unless critical limits are exceeded. They are vector-dependent, and a notion such as power contains many sub requirements such as total energy consumed for a given task, peak power, average power, power density and many others. Many of these are values over time, and when thermal impacts are added the time constants involved can be considerable. Similarly for performance, sustained throughput may be more important in some cases than maximum throughput.

Power management
There is one aspect of power coverage that is easily dealt with. “We have to separate the notions of low power design and power management,” explains Srikanth Jadcherla, low power architect in the verification group of Synopsys. “Power management is simply making sure you turn things off that you are not using and that the resources you need are available and ready to operate at the right times.”

Power management is well serviced by the existing metrics and tools. “If you plan to shut something off, then you need to verify that it is actually shut off,” says Krishna Balachandran, product management director at Cadence. “You need to ensure that the control signals are all in the right place, the state machine works properly, and that retention is working. You could have assertions automatically extracted for these kinds of things and you could use a formal tool to check them.”

But there is at least one area of this that does create problems. “Power management is often controlled by the operating system,” adds Jadcherla. “People have struggled with the notion of what to cover, how to cover it and what Coverage means in terms of the design. It is at the hardware-software interface, so whose job it is becomes part of the problem.”

In the past, standards such as the advanced configuration power interface (ACPI) have been used to define this interface for computers and there is an IEEE effort underway (IEEE P2415) to define how the OS interacts with the chip with a goal that this information can be communicated to design and verification tools in the future.

But where does power management end and low-power design start? “When a part of the die gets too hot, you want to switch context and move activity to another part of the die,” says Jadcherla. “That requires you to take on overhead. A lot of people are grappling with this today because heat, or power density, are such difficult spatial phenomena. You have something in the operating system or algorithm running a policy based on some analog parameter. This makes it a complex problem.”

While the focus is often on power, it is generally heat and current that creates failure conditions, while energy optimization provides greater battery life. “Power is a balance of various scenarios,” explains Jadcherla. “Once you have a thermal model integrated, you can speak to power at a much higher level of abstraction.”

Metrics and coverage may not be the right approach for all verification problems. Drew Wingard, chief technology officer at Sonics, argues that performance should be treated in the same way that timing used to be. Before static timing verification, timing was verified at the gate level. “There were a series of major microprocessor designs that encountered disastrous results when design complexity reached the point where there was no one on the design team who could keep track of critical timing paths at the module boundaries. We need an algebra that allows us to describe some requirements of a system and then validate that we are achieving those.”

Wingard believes that the algebra for performance is not that difficult, but the insertion process may create more difficulties. “The challenge is in getting the people who create the sub-components to describe and formally verify the performance requirement and characteristics of their components. With that information we could have static performance analysis just like static timing analysis.”

Today, many people use assertions to check for worst-case latency or other system-level timing and performance issues. While simple, questions remain about whether the right scenarios have been executed to find those worst-case conditions. This topic was discussed in more detail in a related article, “Abstraction: Necessary But Evil.”

Static verification methods can verify the edge conditions, but they may not be able to provide information about typical or sustained throughput within a system. In a roundtable conducted earlier in the year, Harry Foster, chief scientist at Mentor Graphics, stated that functional coverage is the wrong metric for the system level. “The problem is that functional coverage is very static in the way you describe it,” Foster said. “I am now dealing with distributed state machines and I have to worry about states that are dynamic over time, and this means that I have to run many simulations and extract a lot of data to find out what is happening from a system perspective. You cannot express that with our existing metrics.”

With increasing abilities of machine learning and data mining, these may provide the basis for tools in the future that could sift through large quantities of traces from virtual prototypes, Emulation or FPGA prototypes to provide answers to these questions.
Low power verification
So what makes power so difficult? “Let’s think about where heat comes from,” explains Jadcherla. “Power is dissipated due to a flow of current at the junction from drain to source and also in the wires that ferry the current through the chip. So there is heating in the wires and in the junction. As you go from 28nm to 10nm, the junctions become close together, the density of the heat increases. For the same power, the area heats up a lot more.”

The industry already has power estimation tools that do a reasonable, and improving, job with that part. Balachandran describes a recent Cadence tool introduction. “At RTL, the tools does fast synthesis and brings in the physical effects and converts the switching activity that is created by an emulator or simulator and this is converted into a power profile. There are also utilities than will allow you to identify the peak and to narrow down the window. The signoff tool can use that narrow window rather than regenerating the complete stimulus at the gate level for sign-off.”

But that is not the end of the story. “That heat then has to pass through the material,” Jadcherla says. “This is where time comes into play, and time is not a friend. We are packing so much into a square millimeter that it heats up within timeframes such as a millisecond. That time constant is nowhere near the heat dissipation time constant of the die and package. So hot spots will develop – even when the chip is not particularly active. That one spot may become unusable. You may have to wait for it to cool down before using it again and switch processing to another part of the chip. This is a power/performance tradeoff.”

In order to find these situations, it is necessary to run long enough simulations with the right scenarios. “You have to start from a set of goals and this need to be captured in an executable fashion,” says Wingard. “The best we can do is to validate against those goals. The original goal was to have the following use cases with a power envelope that looks like this for each one of them and then as we get through the implementation we can double check those.”

But we come back to the question about running the right scenarios. “Some scenarios may have come from the marketing department for how they are going to promote the product, but we know that people discover new use cases and models along the way,” Wingard adds. “Surprises are not uncommon especially for use cases that were not considered in the original goal set.”

So what is the right coverage metrics to ensure that an adequate set of scenarios has been selected? “I am not sure if the word coverage is broad enough for that,” says Jadcherla. “We do need to be able to predict the worst case scenarios. We have to think about which would be the worst case. A few hundred scenarios are what most companies use. You are looking for the quantification of impact. Then the notion of coverage comes in so you can say these are the top 10 or top 5 scenarios. Many semiconductor companies have ad-hoc methods they use for this today.”

Jadcherla admits that there are some missing elements in these solutions. “They are beginning to target a higher level of abstraction and need some static and dynamic methods to go after those scenarios. Then you need a higher-level method of abstracting power itself. Then you need very accurate modeling of that power for heat or current phenomena.”

It appears that we are not yet ready to define what coverage means for many of these design factors that are growing in importance. Apple is finding out about this the hard way with the latest release of the iPhone 6s which uses chips manufactured by different foundries. While the functionality of the two is the same, they reportedly have very different power profiles under some use cases.

Up until this point, design companies have been relying on their instincts. “We all know what happens when complexity meets instincts,” says Wingard.