Overcoming Regression Debug Challenges With Machine Learning

Automatically discover the root causes of simulation regression failures.

popularity

Development of a modern semiconductor requires running many electronic design automation (EDA) tools many times over the course of the project. Every stage, from architectural exploration and design to final implementation and manufacturing preparation, has multiple methodology loops that must be repeated again and again.

Even in such a complex development flow, functional simulation stands out. It takes billions of simulation cycles to verify that a chip design is doing everything it’s supposed to do without unintended behavior. This is not a one-time effort. Every time that any part of the design changes, the entire simulation test suite—or at least a very good portion of it—must be rerun. The suite expands throughout the verification and development effort as tests are added to verify new features or increase focus on areas of the design where bugs are being found.

Simulation regressions require a large number of tests run on a regular basis, usually nightly for a sample set and weekly for the full set. Running these tests consumes lots of resources, and creates a bigger challenge whenever tests fail. Engineers make mistakes when adding new features to the design and enhancing the test suite, so the resulting errors must be debugged and resolved.

Furthermore, some previously passing tests fail in updated regression runs. New features often break existing features, and any code edits can have ripple effects. Sometimes every test fails, especially after major changes are made to the verification environment. Debugging these failures is primarily a manual effort, requiring multiple steps:

  • Check in the latest changes to the design and testbench code
  • Run the regression simulations
  • Analyze the log files containing thousands of test failures
  • Categorize the failures and sort them into “bins” based on the type of error
  • Triage each bin to determine where the problem most likely occurred
  • Perform root cause analysis (RCA) to try to pinpoint the actual bug
  • Change the design or verification code to try to fix the bug
  • Start the loop all over again!

This process relies heavily on the expertise of the development engineers. Years of experience help them develop a sense of how best to bin the failures, triage the bins, and assign the failures to the correct design and verification engineers for root cause analysis and fixes. However, it’s difficult to find enough experts, so it takes significant project time and resources for this manual approach. Chip development teams have long been clamoring for a better way to manage and debug regression loops.

Recently, artificial intelligence (AI) using machine learning (ML) technology has become available to automatically analyze, bin, triage, probe, and discover the root causes of regression failures. By leveraging the enormous amount of information gleaned from thousands of regression runs on the project, AI acts as a companion to traditional engineering expertise. By automating and accelerating three steps in each loop, ML techniques can provide faster and more accurate debug than manual methods. By helping the engineers find, understand, and fix the bugs much more quickly, ML improves overall debug effort up to 30X.

The Regression Debug Automation (RDA) capabilities in Synopsys Verdi Automated Debug System use such ML techniques to automatically discover the root causes of simulation regression failures. RDA classifies and analyzes raw regression failures and identifies the root causes of failures in the design and testbench. Automating the regression log analysis, binning, triage, and RCA greatly reduces manual effort.

RDA starts by collecting data from the regression run, including simulation log files, value change dump (trace) files, and compiled simulation databases with the design and testbench. It uses ML to mine relationships among the verification log failures and bin the results. This process has been shown to be 90% accurate in determining related results, reducing the overall triage time. After binning, RDA performs failure analysis and triage. It takes the bins of failures and determines whether the issues are from the design or the testbench based on the characteristics of the failures.

RDA uses multiple technologies to find the root causes of failures. For the design, it compares the values of signals from passing and failing tests to isolate failure points that differ near the test errors. Visualization shows the RCA path along with the signal value changes in the design. To root cause testbench failures, the RDA debug facilitator automatically collects debug data for each failure bin. It shows protocol transactions with associated details and uses a reverse debug capability to view the source of the issues back in time.

Synopsys Verdi RDA includes additional capabilities to save the engineers even more time and effort:

  • Failing tests are automatically rerun in simulation with reverse debug and other debug features enabled
  • Testbench RCA includes awareness of the widely used Universal Verification Methodology (UVM)
  • RCA is performed on test failures related to unknown (X) values to reduce the number of groups
  • Test failures due to simulation X-pessimism are filtered out

All these automated techniques, backed by the power of ML, accelerate the three most challenging steps in regression loops. More accurate debug means that fixes are much more likely to be correct the first time, considerably reducing the number of loops throughout the project. Verdi RDA saves significant time and effort for debugging every failing test while reducing the number of required failing tests for debug. This maximizes regression utilization, focuses manual efforts on high-value debug rather than automatable tasks, and cuts the overall debug regression effort on a chip project in half.

For further information, a white paper is available.



Leave a Reply


(Note: This name will be displayed publicly)