Can machine learning bring debug back under control?
Debug consumes more time than any other aspect of the chip design and verification process, and it adds uncertainty and risk to semiconductor development because there are always lingering questions about whether enough bugs were caught in the allotted amount of time.
Recent figures suggest that the problem is getting worse, too, as complexity and demand for reliability continue to rise. The big question now is whether new tool developments and different approaches can stem the undesirable trajectory of debug cost.
Time spent in debug has been tracked by the Mentor, a Siemens Company and the Wilson Research Group over several years. The latest numbers, shown in Figure 1, reveal that the largest amount of an IC and ASIC verification engineer’s time—44%—is spent in debugging. Synopsys’ Global User Survey corroborates that result, noting that debug consistently has been one of the top two verification challenges that customers face.
Fig 1. Where IC/ASIC verification engineers spend their time. Source: Wilson Research Group and Mentor, a Siemens Company
What this chart does not show is the upward trend. The 2010 study showed debug consuming 32% of a verification engineer’s time. By 2012 it had risen to 36%, 2014 saw it at 37%, and 2016 at 39%. That increase cannot be explained by any statistical anomaly and clearly shows that debug is becoming a bigger problem. So far, tools have not managed to keep up with the growing complexity of the problem, despite increased investment from the tool companies.
The reasons for the increase in debugging complexity are plentiful. Moses Satyasekaran, product marketing manager at Mentor, a Siemens Business, notes some of the areas that are having a negative impact on debug:
Older methods for debugging abound, for better or worse. “The majority of users lean on visual aid for debug,” notes Satyasekaran. “Hence, debug strategies haven’t changed very much. Customers still rely on waveforms and schematics.”
Dave Kelf, CMO for Breker Verification Systems, agrees. “It is surprising that more attention has not been paid to this area. Signal-level debug has remained the mainstay of debug systems for more than 25 years, leveraging waveform tools reminiscent of the logic analyzer.”
While tools have improved significantly, some of the underlying practices have remain outdated. “Debug in the past has been isolated into various domains corresponding to team responsibility,” says David Hsu, director product marketing for Synopsys. “While this simplified the local triage problem—finding root cause and responsible owners for bugs was reasonably straightforward—it also had the consequence of pushing out debug of system-level failures very late in the product cycle, in many cases into post-silicon. Combined with the growth in design and verification complexity and size, this of course created extremely long and difficult late-stage debug and triage cycles.”
Systems are no longer just hardware. “With the surveys indicating the amount of time developers spend finding and fixing bugs, the cost of software debug across the industry is indeed very substantial and has our full attention,” says Guilherme Marshall, director of Arm’s Development Solutions. “In recent years, among other factors, the increased complexity of software stacks and more pervasive use of SOUP (Software of Unknown Provenance) have, undoubtedly, pushed the industry toward longer debugging cycles. To make it even more challenging for software developers, physical debug interfaces for run-control and execution trace have been increasingly designed out of production silicon so that semiconductor vendors are able to achieve device miniaturization and/or cost reduction requirements.”
Predictability is another problem. “The effort and time to fix bugs is very difficult to predict,” notes David Otten, development tools marketing manager for Microchip Technology. “As systems become more capable with increasing integrations and multiple processors, the multitude of interactions complicate the task. It may be necessary to consider their statistical nature. As the difficulty of a bug increases, the probability of its occurrence likely decreases but there will likely be many more simple bugs to fix. In most cases it’s probably a non-linear relationship between the effort needed to fix a bug and the number of occurrences for them. Picture a Poisson distribution with lambda of 1.”
Part of the problem is the difficulty automating the task. “It is very difficult to build a generic automatic debug tool that can fix any software or HDL problem,” says Daniel Hansson, CEO for Verifyter. “But there are examples of automatic debug solutions for simple well-defined classes of errors:
“The way automatic debug will conquer the world is probably by identifying other classes of errors that can be automated, ” adds Hansson.
The most important aid to debug may be to have a well-defined process. “Static analysis tools remain popular because they can detect issues beyond compiler syntax, such as suspicious code and coding errors,” says Microchip’s Otten. “Automated testing ensures legacy code continues to meet project standards. Many development systems can ensure new code passes these regression tests before it can be added to the project repository. It’s also becoming more common for developers to create a test suite prior to writing their application code.”
Debug triage
Most companies have large regression suites and deploy continuous integration. “Regression testing failures is one such class of errors where engineers spend a lot of their time,” says Hansson. “Continuous integration is a solution for compilation errors and small directed test suites, but for the larger and random tests suites, that the ASIC development world is based on, continuous integration is not enough.”
Debugging starts when a test fails. “Today’s triage flows are mainly manual,” says Synopsys’ Hsu. “It requires comprehensive visibility into the verification environment, and powerful analytical tools to correlate bugs found in huge quantities of verification data into useful root causes.”
Part of the process to get to a bug confirmation can be automated. “Triaging large regression runs can be automated and made predictable,” says Mentor’s Satyasekaran. “For example, automation could back out the fix and rerun test to confirm the issue. The time that users spend is then on debugging problems rather than on triaging large regression runs. Another automation is to enable debug visibility automatically for the failed regression tests so the engineer is ready to go.”
Using multiple verification technologies complicates this further. “Finding and fixing bugs earlier in design and verification cycles entails having advanced debug technologies that work across domains and flows,” adds Hsu. “This can be accomplished by natively integrating debug tools with simulation, static, formal, emulation and prototyping verification engines. However, this creates a new triage challenge: In order to really fulfill the promise of Shift Left, identifying the domain, root cause, and responsible owners for verification failures quickly enough is becoming a critical issue.”
In some cases, different technologies can help each other out. “One key factor in debug is the simulation test that triggered and detected a given design bug,” explains Sasa Stamenkovic, senior field application engineer at OneSpin Solutions. “The shorter and more focused the test, the faster and easier it is to diagnose and fix the error. When the test is generated by formal verification rather than by a simulation testbench, debug is easiest. Formal verification considers all possible stimuli to prove or violate the assertions against the design. If a violation is found, a formal tool displays the precise stimulus sequence, known as a counterexample, that triggered the bug. Many formal tools also can export the counter-example as a simulation test to enable debug in a familiar environment. Since formal verification determines exactly how the bug was triggered and which signals are relevant, the generated test is indeed shorter and more focused than tests from constrained-random test benches.”
This can be extremely helpful when a bug is found deep within a run. “In the past it may have been possible to collect all debug data, store it away and then go back to it when needed,” says Frank Schirrmeister, senior group director for product management and marketing at Cadence. “That’s no longer feasible with fast engines like emulation where data can be created fast, but the ‘time to waveforms’ is impacted by the processing it takes to prepare raw information for debug. The engines generating the debug information need to offer the appropriate flexibility to give debug engineers option for data collection.”
That may be creating problems for some companies. “Large emulation systems continue to fall out of favor to their less expensive debugger counterparts,” claims Otten. “As processors continue to increase in complexity and speed, there is too much data to transfer to make them advantageous for their price premium. Smaller, less expensive debuggers offer most of the same functionality and align well with distributed development.”
This is most likely to happen when software is involved. “It is necessary to invest in integrated software and hardware technologies for software development,” says Arm’s Marshall. “IP components offer SoC designers a solution that offers secure debug and trace channels over existing physical links such as USB, CAN bus or WiFi. To complement the hardware solution, virtual prototypes provide both instruction accuracy and full program execution visibility, free of probe effects. Both development targets allow tools to be employed within rigorous DevOps workflows—such as continuous integration—to accelerate bug isolation, discovery and fixing.”
Debug issues can go beyond pre-silicon. “Internet-of-things (IoT) applications continue to grow which results in new classes of bugs that relate to communication, security and compliance,” says Otten. “New tools are available that can identify such issues automatically each time the project is built. Many of these applications incorporate a bootloader which permits application code to be updated after deployment to add features, fix bugs or respond to a threat. These updates are difficult to test as specific operating conditions are unknown. Also, many customers are reluctant to perform these upgrades due to fear of breaking the system or not knowing how to apply these updates.”
Automation isn’t always straightforward. “Any system that interacts with other hardware will be difficult to automate,” Otten adds. “Basic connectivity can be easily tested but it will be difficult to exhaustively know which types of hardware will be attached, which types of signals will be exchanged and the range of operating conditions it will be exposed to. On the software side, performance bottlenecks are difficult to detect algorithmically.”
New approaches
In order to reduce debug time, new, more powerful debug techniques are necessary. “There is a field of academic research called ‘program repair’ where the ambition is to build tools that automatically debug and fix software bugs,” explains Hansson. “So far, they have had moderate success. One approach is to make every regression test failure pass again by automatically modifying the code until the tests pass. This only works on regression failures, i.e., tests that used to pass. It does not fix the problems but uses the test passes as proof that its debug analysis was correct. The fix is still something the engineer needs to do.”
Machine learning (ML) is receiving lots of attention. “We see opportunities to automate triage with ML and artificial intelligence (AI),” says Hsu. “Using big data platforms and ML techniques to automate root cause analysis (RCA) is key to further major reductions in triage cycle turnaround-time. Looking even farther out, actually predicting when/where bugs may occur is a goal, and an important stepping stone towards the goal of fully prescriptive verification.”
Debug has some unique challenges. “Debug is a classic situation in which ‘you don’t know what you don’t know’,” says Schirrmeister. “While with 20/20 hindsight the issues leading to specific defects may be obvious, finding them in reality often remains tough, even nightmarish. Applying machine learning techniques is a definite must. It likely will help to trigger smart inputs about the dependencies and issues that debug engineers may not have thought about in the process.”
How close to reality is this? “Machine learning has an emerging place in the debug process, but its current impact is very small,” says Otten. “Its future in debugging code is bright, but because the types of bugs are almost infinite, there is no sufficient data set for machine learning to reference. The first emergence of machine learning in this capacity would likely be anomaly detection. Any meaningful contributions of artificial neural networks or support vector machines are probably decades away.”
Getting the necessary data could change that. “New tools are emerging that take a more holistic view to debug,” says Breker’s Kelf. “By extracting data from across the entire design it can present a holistic view of the debug scenario. The application of ML to these holistic datasets allows for the classification of multiple failures providing rapid direction to the root cause. To enable this, debug needs to become an integral part of the verification process, rather than an afterthought. Using the right Portable Stimulus tools, it is possible to include debug detail directly into generated test sets, zeroing in on common failure modes that can then be run with advanced ML-based debuggers. This new paradigm to debug allows system-level corner cases from emulation runs to be properly analyzed and rapidly resolved.”
The abstraction level also needs to change. “Debug needs to transition from a signal-by-signal to a comprehensive, structured view, where the entire design and test scenario are inspected from an abstract global view,” says Hagai Arbel, CEO for VTool. “ML has its place in these new environments, automatically classifying data that allows engineers to make fast and accurate decisions, shortening or even eliminating the debug cycle. The key to enabling next-generation debug is the combination of these technologies in an effective and cohesive manner.”
An important question is whether data from a single company is enough. “The challenge will be user’s willingness to share data to train the machine,” says Satyasekaran. “Most companies want to keep the data in-house, so I could see initial deployment of ML being local vs. global.”
Unfortunately, development and deployment of these ML-based debug tools is not mainstream today, which means we can expect to see further pressure on existing tools and a continuation of the trend of increasing debug times. At the moment debug appears to be the victor, but nobody is willing to declare defeat.
Leave a Reply