中文 English

Brute-Force Analysis Not Keeping Up With IC Complexity

How to ensure you’ve dealt with the most important issues within a design, because finding those spots is becoming a lot more important.

popularity

Much of the current design and verification flow was built on brute force analysis, a simple and direct approach. But that approach rarely scales, and as designs become larger and the number of interdependencies increases, ensuring the design always operates within spec is becoming a monumental task.

Unless design teams want to keep adding increasing amounts of margin, they have to locate the areas of the design that are the most sensitive to some form of variability in order to deal with them appropriately. Those can be functional, safety, voltage, temperature, or manufacturing variability.

“With rising manufacturing costs at advanced nodes, designers are introducing ever-increasing design margins to manage sensitivity risk and ensure yield,” says Rahul Deokar, product marketing director for the Digital Design Group at Synopsys. “These margins are limiting semiconductor innovation, and are a significant challenge when designing the fastest, most efficient SoCs, thus leaving power, performance, and area on the table.”

Many of the statistical techniques used in the past no longer work, as some of those interdependencies become multi-physics in nature. These complex relationships can result in non-linear or even discontinuous relationships. As a result, finding the necessary minima or maxima becomes computationally impractical.

Functional interdependencies also create problems that cannot be dealt with through margining. Verification tools and flows often do not help to locate the most important areas to verify first, because they treat every line of code or coverage goal as equals.

“While we have much more computing power than in the past, we do not have the luxury of using every possible method and every possible piece of data that we want,” says Uri Feigin, product lead for Vtool. “If you do, you will find out that computing power is still not enough. We have to reduce the amount of garbage data.”

The industry has relied on abstraction and statistical sampling in the past, but these are becoming more difficult and increasingly inaccurate in some cases. Some of the industry is looking toward artificial intelligence (AI) to help solve the problem, while other believe that more basic intelligence is needed in how the problems are approached.

The right abstraction
The industry often has relied on abstraction to simplify a problem. By getting rid of unnecessary details, a lot more analysis can be done that is focused on the primary problem. This still works well in many cases, but it is important to ensure that the abstraction is both valid and helpful.

“Virtual prototyping tools and models enable SoC architects to perform sensitivity analysis for early architecture analysis,” says Tim Kogel, principal applications engineer for Synopsys. “A typical example is the optimization of SoC interconnect and memory configuration, such that the specific bandwidth and latency requirements for all IP subsystems are satisfied. This requires sensitivity analysis of potentially hundreds of design configuration parameters like clock frequencies, data width, cache and buffer sizes, or outstanding transactions, against performance related metrics like bandwidth, latency, utilization, and contention per component. Power, energy, and area are also included into the sensitivity analysis to trade off performance improvements against cost related metrics.”

A good abstraction does not always imply high-level. “At the transistor level, you can run simulations to understand how sensitive each transistor is to process variation,” says Nebabie Kebebew, senior product manager for AMS verification at Mentor, a Siemens Business. “You may need to handle the sensitivity of your analog circuit to process variation, because that’s impacting your power consumption. You can run device size sweeping and see how your design is performing. This is using worst case PVT corners that you found earlier by running simulation across the entire circuit. That feedback, coupled with designer’s insight, enables them to make tradeoffs to meet a low power consumption goal.”

Parameter sweeps are used in other areas, as well. “It is a straightforward approach,” says Synopsys’ Kogel. “You can vary design and configuration parameters, which are known or suspected to move the needle on the target key performance indicators (KPIs). The results can be post-processed by using pivot charts for plotting the impact of design parameters on KPIs, as an example. Figure 1 shows power, energy, and latency of an inference accelerator in the context of a DDR memory controller over various design parameters like number of parallel cores, DDR speeds, and further controller configuration parameters.”

Fig. 1: Runtime and energy comparisons with design parameter sweeps. Source: Synopsys

Fig. 1: Runtime and energy comparisons with design parameter sweeps. Source: Synopsys

But not all historical abstractions remain useful, and the problem may have to be reformulated. Consider design for test (DFT), which used to be based on a stuck-at-fault and transition-fault model. This was a simple model, and it used to have fairly good correlation with manufacturing defects.

Changing the model can be tough. “DFT teams can spend several years establishing ATPG targets, such as coverage goals, pattern size, or some other metric, for stuck-at and transition fault models,” says Ron Press, director of technology enablement for Mentor. “Results have been published with defect-oriented tests, which no longer model defects based on a fault model abstraction. Instead, it uses a physical modeling of the design to determine where defects can occur. It has been shown to be much more accurate to model and test for real defects. The trouble is figuring out which patterns from the traditional and new patterns are most effective to use.”

Selecting patterns is a problem for several application areas. Within DFT, Mentor’s Press describes the process as “first determining the likelihood of defects occurring based on their critical area, a new approach called Total Critical Area, which can sort the various pattern sets considering the defects they detect and then choose the most effective patterns to apply.”

Similar problems happen with power analysis. “I may have thousand of vectors,” says Preeti Gupta, head of PowerArtist product management at Ansys. “How do I identify those which have the most active signals that are common across all these vectors? If I have timing critical paths, how do I get more data about the timing power sensitivity along those paths in order to make design decisions? You can’t possibly have a transient analysis tool work on millions of cycles because you would never get the chips taped out. You have to minimize the number of patterns down to a handful. Then you can run the analysis and look at dynamic voltage drop on the chip. We don’t have the luxury to run complete analyses across all the possible scenarios. Instead, we need to be smart and deploy careful but safe pruning.”

Focus
Having tools and metrics that allow you to focus your attention helps. “It often takes years of production fail data for a company to decide on appropriate manufacturing test goals,” says Press. “Consider targeting test coverage for all potential bridge faults, which could be a huge list. You might achieve 99% detection of all bridge faults but miss hundreds of the most likely bridges. To reduce DPM, it is more effective to choose the subset of bridges that is most likely to occur.”

Functional verification has a similar challenge. “The notion of functional coverage in UVM, and SystemVerilog in general, is oriented to whatever is easy to cover,” says Darko Tomusilovic, verification manager at Vtool. “What often happens is that a lot of cover points were defined simply because they were easy to add, even if they do not provide much meaningful value. There have been many times when real corner cases where not covered, simply because it was too difficult to express them in existing coverage languages. We have a long way to go to achieve similar level of quality in the area of coverage. Plenty of times it is simply seen as an overhead because by the time we get to coverage we already know that this design more or less works.”

Statistical analysis
One technique that has been employed is statistical analysis. This allows you to take samples and extrapolate the likelihood that you have found an acceptable number of the items you were targeting. “Statistical sensitivity analysis is a concept that has existed for decades but was never feasible for deployment beyond a small design block, even less so in a large-scale SoC production flow,” says Synopsys’ Deokar. “Monte Carlo simulation for true-statistical analysis would need to repeat the same analysis thousands, or even millions, of times for high-sigma accuracy, which is much too time-consuming to keep up with the fast pace during SoC design implementation and signoff.”

For many sectors, high-sigma is becoming a necessity. “Any chip that goes into automotive, or any chip that gets into mission-critical applications or anything that requires a lot more precision requires high sigma,” says Sathish Balasubramian, senior product manager for AMS Verification at Mentor. “There is no way, in a given time, that they can verify analog circuitry using Monte Carlo sampling. Theoretically, you could do it if you have unlimited resources and enough time. But even with this, people are trying to take the worst case scenario and asking if I’m building enough margin in my design that it should work. This may require more guard-band, which affects both performance or area.”

A new approach
For many of these problems, a new approach is required. “Sensitivity analysis is highly desirable for thermal and also thermo-mechanical optimization on the packaging side,” says Andy Heinig, group leader for advanced system integration and department head for efficient electronics in Fraunhofer IIS’ Engineering of Adaptive Systems Division. “Some of the thermal and thermo-mechanical effects are non-linear in nature. That means it is very important that critical sections are detected. Sensitivity analysis could be a step into this direction, but standard sensitivity methods are not enough, and they must be coupled with more advanced artificial intelligence (AI) methods because the root of the problem varies with different packages.”

Mixing old and new technologies could be a solution. “Armed with machine learning technology, we can complete statistical sensitivity analysis in minutes, on a single execution host, that previously could take days, and with the same accuracy,” says Deokar. “Machine learning coupled with Monte Carlo simulation — which because of its repetitive nature is an excellent application for machine prediction, and can speed things up by 100X to 10,000X — enables full-chip SoC or high-sigma analysis that was never feasible before. We can overcome the previous turnaround time challenges with machine learning technologies to enable analysis and optimization for every design of any size with HSPICE accuracy within minutes, versus days or weeks required by full statistical simulations. This help designers to deliver silicon designs that are resilient to variations and sensitivity vulnerabilities, and are faster, lower power, more robust, and more cost-effective.”

Similar techniques can help in functional verification. “In the typical ML application, first you have the training process, where you basically teach ML to find the appropriate algorithm,” says Vtool’s Tomusilovic. “Based upon the input and output you find the appropriate algorithm. You usually run test cases in multiple loops. Each of these loops has an outcome, let’s say, either pass or fail. What we try to achieve is to find what is common for all the cases in which there is a passing scenario, and then to use ML to help us find the disturbance, in order to understand what is unique, and what is specific to all the failing scenarios.”

But it doesn’t appear to work everywhere. “Ever-growing complexity drives the need to automate system-level design space exploration,” says Kogel. “The classic approach is to use evolutionary algorithms, and more recently AI-based methods like reinforcement learning. While this works well for new AI enabled back-end implementation tools, we have yet to see a break-through success in the context of system-level architecture optimization.”

It is important to apply AI to the right problems. “Human intelligence remains mandatory,” says Tomusilovic. “It can’t be replaced. Someone has to provide the appropriate data and must ask the machine tool meaningful questions. You cannot simply ask it for the solution to the problem. You must be very specific and know what kind of solution you want, and that means you have to understand the problem that you have.”



Leave a Reply


(Note: This name will be displayed publicly)