Using AI And Bugs To Find Other Bugs

New methodologies are being developed to deal with increasing complexity.


Debug is starting to be rethought and retooled as chips become more complex and more tightly integrated into packages or other systems, particularly in safety- and mission-critical applications where life expectancy is significantly longer.

Today, the predominant bug-finding approaches use the ubiquitous constrained random/coverage driven verification technology, or formal verification technology. But as designs become more complex, new methodologies and approaches need to be applied to ensure quality over time. This can include everything from AI/ML to simply updating methodologies to include more automation and less bug tracking in a notebook or spreadsheet.

“For constrained random, the usual approach is to create a coverage model from the verification plan and objectives, which serves as a map of where to target the stimulus to test the important design features thoroughly,” said Mike Stellfox, a Cadence fellow. “In a constrained random verification scenario, the tests are generating all kinds of random stimulus that exercise different paths through the design that you wouldn’t necessarily think of as a human, and bugs will be found. Because you have coverage goals, you know whether you exercised what you thought were the important things. What typically happens is a bug will be found, such as a certain behavior, either triggered by a checker identifying that this didn’t behave correctly, or another second-order effect. This could include not having a direct checker, a state machine hung, or another erroneous behavior occurred. Once the bug is found, a root cause analysis will be done with the designer to determine the cause, and that’s where you go through the debug process.”

Usually, multiple people are involved in finding a bug. It can involve the designer, who accidentally inserted it, and the verifier who actually discovered it. The designer fixes it, and as part of that process they may sure the problem is fixed the way it was supposed to be fixed. That’s where there is an opportunity to do things more systematically. Today, this mostly depends on the experience and intuition of the verification engineer.

“They will then look for ‘cousin bugs,’ since the bug may have been fixed under a set of certain stimulus conditions,” Stellfox said. “The nice thing about constrained random is if you keep things a little bit open, besides the original test, you can run lots of other stimulus in that area. You try to soak the area to see if you’ll find other cousin bugs. This is an area where there are opportunities to leverage a data-driven approach.”

The same thing happens in formal verification. “What’s become really popular in formal verification is bug hunting, where the goal isn’t to close on a proof,” he noted. “This is similar to what happens in a constrained random approach, but instead of using a constrained random stimulus engine to exercise a design, you’re using a formal engine to exercise it, and these use a similar concept of coverage points to look for the areas of concern. The nice thing with formal technology is the ability to explore a lot more space relatively quickly, under all cases. But when you’re in bug hunting mode, you’re typically trying to stimulate the design in some specific area to find bugs while you’re achieving your coverage goals.”

A number of industry players are investigating way to automate the debug process. As systems become more complex, and as tolerances tighten, what used to be considered third-order problems are now second- or first-order problems. That means bugs need to be hunted down in more places. Good starting points include RTL and testbenches, where variables are typically introduced or where most of the changes occur.

“There’s a high correlation between the source code revision control system that’s tracking where revisions are, and where the verification engineer should be focusing bug hunting, because those tend to be the highest areas where there’s new bugs that are introduced,” Stellfox said. “Once a bug is introduced, and the sooner you find it, the easier it is to debug and perform root cause analysis on. If a designer is working on a block and they introduce the bug, because they’re the one who wrote it, if they can find a bug there, that’s the fastest way to get the bug root caused and removed. Because one person is involved, it’s a small scope, so you can run either simulations or formal very quickly. There’s been a lot of focus on that area to get designers finding lower-level bugs earlier.”

As the scope increases, such as with an IP, there is a verification engineer creating a UVM constrained random environment or a formal environment around a bigger scope. That helps, on one level, because it adds a new perspective. But because it’s a bigger scope, it can take longer to get to the bug and longer to fix it, he said.

All of that happens on the hardware side. On the software side, things are different. Software engineers are used to dealing with continuous upgrades of the technology, and that’s very evident with applications involving machine learning.

Still, for many verification engineers, their personal notebook is the best way they’ve found to keep track of bug-finding activities. That is quickly proving to be inadequate.

“Somehow my notebook has to be automated, and instead of me going through my notebook and reading about all the bugs I found before, the notebook should come to me and remind me when I’ve already done something,” said Darko Tomusilovic, verification lead at Vtool. “Ideally, there should be a tool that once I log a debugging cycle, the next time I do the same debugging the tool should be smart enough to suggest whether it is the same problem that might have happened before. The tool should be able to recognize, or at the very least to suggest, that I take a look at a particular signal if it helped before. Or, ‘Look at this message. It helped before. Maybe it will help again.’ Then the user should be able to tell the tool if it is right or wrong, and the tool should learn from it. It would become increasingly smarter by using the user’s feedback.”

Debug by the numbers
Complicating matters is that debug is difficult, and it’s becoming more so as dependencies between devices continues to grow. If there was a tool to trace each step every time debug is done, and then allow user interaction, the engineer could act at the beginning so the tool could learn and understand the steps.

“This might help at the end with some conclusion about the next bug, or at least indicate it was nothing that the engineer did so far, which is also good input, and means to go in a different direction,” said Olivera Stojanovic, project manager at Vtool. “Still, it is very hard to define all the steps to do debug. It’s something intuitive. You need to figure things out, draw some conclusions, and make some decisions according to the previous experience while you’re doing debug, or go in a different direction. But it’s very hard to define how you came to a certain conclusion.”

Tomusilovic believes that engineers are too biased, and too reliant on their intuition. “Sometimes your gut feeling can help you solve a problem in no time. But on the other hand, that same gut feeling can steer you in a completely wrong direction and get you stuck there for days. Somehow your gut feeling should be utilized better in that sense. Experience helps, but even there your experience may lead the wrong way — especially in overlooking something simple. The tendency is to think the more experience that is gained, the better the system is understood, and that the problems will be harder, and more difficult. In this way, if a simple problem pops up, it may be easily overlooked, and that can lead to lost time. By suspecting and investing time into a very complex problem, a lot of time is needed to validate it, to find it, to detect it. Then, if it turns out it was actually something simple, like no clock, no reset, wrong signal connection, or a similar issue, this can be a schedule killer. A good tool would help me easily detect simpler problems, and then let me focus on more complex stuff on my own right, trusting my gut feeling and my experience.”

In every scenario, however, verification has a lot of potential to impact not just for what you’re doing now, but for the future. “We need to find the bug of the future as the complexity of the CPU increases,” explained Philippe Luc, director of verification at Codasip. “If your validation also increases in complexity, you will not find more bugs. You need to increase the powerful test of the verification, so that your simulation capacity doesn’t increase as fast as the CPU gets more complex. You have to be smarter to find the best algorithm to be able to find the new RTL design bugs. AI techniques are beginning to be used to boost results. There is a point where a human starts to not understand what the verification method does because too many random generators are mixed together. This is a problem that must be fixed, and it is a place where AI techniques can be applied to take this further.”

At the same time, this doesn’t mean AI/ML/DL are the answer to everything, especially because so much is still evolving in this space.

“While it’s really satisfying to say, ‘I don’t have to work anymore, I just give the machine the data and it will find the next test,’ the drawback is that we don’t know what the machine is doing,” Luc said. “The general idea for applying machine learning in verification is to pre-generate some tests, ask the machine if these tests will be good, meaning it will find a new bug or add more coverage on it. If the machine says it will not be adding coverage, don’t run the job. But if it says it sees the potential for finding new coverage, run it.”

This approach removes some of the tests from the validation task, however, and some of the most interesting bugs are not intentionally sought out. “The last five bugs you find on a complex design — the last ones that could cause a big trouble for the customer — are found by luck,” he said. “I would not predict, as a human, why this test found this bug. So I am concerned that the machine learning algorithm would say, ‘For this test, don’t run it, it looks the same.’ I’m really doubtful you can apply AI techniques that will magically say, ‘Run these tests, and don’t run these tests.’ Today I’m not sure that AI will not remove a test that would lead us to find the next bug, so I’m not confident enough to apply it yet.”

That said, Luc noted that the first step should be to make sure what is being fixed is really a problem, and this is where experience weighs heavily. “Is the strength of the verification adequate versus the complexity of the CPU? So far what I’ve seen is probably yes, but there are areas where there can be improvements. The goal is to be able to spot, from experience, where the next bug will be in a design. When people say they have not thought about it, it means there is something to improve. That’s the difficulty with new verification engineer engineers. They don’t have the mindset to break things, in general, and it’s hard to say, ‘Look deeper here.’ They simply say, ‘It’s alright. It passes.’ It’s a delicate balance of spotting areas where there is definitely a need for improvement and saying, ‘You should really look here because I know from experience that this is a tricky area.’”

Is it a bug or a failure of process?
Others look at bugs as more than just a single occurrence, but rather a failure of process or tools.

“All too often, we look at someone’s tests and see that while they test what they say they do, they are missing any number of other areas that haven’t been considered,” said Simon Davidmann, CEO of Imperas Software. “Some people come to us and ask us to identify the bugs so they can fix them, but that’s a failure of the process. You don’t want to fix the bug. You want to fix the process that allowed the bug to get there. If you fix a bug on its own, you’re failing, because you’re debugging the thing into existence. That’s the wrong approach. Instead, use the bug, analyze it, and work out where the process has failed, such that the bug was allowed to be put in, or allowed to stay there.”

It may be that different tools are needed, or the tools or methodology are inadequate, or the original verification plan wasn’t understanding things properly. “Maybe the process is related to the tools you’re using,” Davidmann said. “So the concept of finding a bug, and it helping to find others is not quite right. They shouldn’t be looking for other bugs. They should be looking for the process. And if you find a bug, you don’t fix the bug, you go and look at your testing strategy. You change all that, and hopefully when you run it, it doesn’t find one bug, it finds three or four.”

He noted that a lot of time is spent trying to help engineering teams with their processes, something that is particularly challenging with RISC-V processor verification because the industry doesn’t have a a verification solution here. “If you want to verify a USB block or an SoC, that’s what the verification industry has done for the last 20 to 30 years. It’s developed a methodology of testing. But with RISC-V, a lot of what we’re trying to help people with is their methodologies and process, not just the bugs. How do you find the bugs? Make sure you’ve got these things in place, starting with the verification plan. Yes, you’re going to find bugs along the way but they shouldn’t be odd, isolated ones at the end. There should be methodologies and tools and process in place up front. We get involved in a lot of the verification of these processors because we’re becoming the de facto reference there,” he said.

This is far from a solved problem for many designs, though. “There is so much more we can do in verification,” said Neil Hand, director of marketing for IC verification solutions at Mentor, a Siemens Business. “AI/ML/data analytics — pick your favorite buzzword of the day — is going to have a huge impact on it. We’re always trying to find the bugs inside the vast nothingness of the design space, because the design space, for any reasonable design, cannot all be hit. If you look at coverage methodologies and coverage analysis, some people think that’s the be-all, end-all. But the engineering community says it’s important, but insufficient.”

Current approaches to bug hunting are largely brute force, Hand said. “If all you’ve got is a hammer, everything looks like a nail. But when we start looking at some of the newer technologies, and what AI/ML can do, it starts to get really exciting because now you can start to identify where to focus. ML-based data analytics can indicate where to put the attention, because that’s where it’s going to pay off the most. But you can go beyond that. You can start applying all of these machine learning techniques into a design to use bugs to find bugs. If you start to find things like a certain style of block has more bugs, or you find that a check-in that happens at 4 o’clock on a Friday afternoon has a higher probability of bugs, or any number of seemingly random things, this is where AI/ML comes into play. It identifies the relationships that we would not identify, and it can start to say where to look. You also can start to use the history over time and for a particular design. As it gets more mature, the errors go from a certain part of the design to another part of the design. Again, you can use machine learning to identify that because you’ve got an unsolvable problem in that you’ve got way more vectors than you could ever run, and the most effective vector you can run is one you don’t have to. There’s no faster simulation cycle than one you don’t have to run. That’s where we can start to apply these AI and ML techniques.”

An unsolved challenge here is sharing data. “Companies don’t want their competitors to benefit from the learning that they have, and that’s understandable,” said Hand. “You can’t actually reverse engineer a machine learning model in most cases to get any meaningful data out of it, but at the same time you don’t want to help your competitors, even though you may be helping yourself. The challenge is how to build up the necessary set of data to be effective. You can do it on a single design. You can do it on a group of designs within a single company. But as with any machine learning problem, the more data you can give it the more effective it gets.”

There are no magic bullets for debug. It remains a painstaking process, although one that is increasingly important as chips find their way into more industrial, mission-critical and safety-critical applications, and as those chips are integrated into more complex systems that need to function for longer periods of time. The challenge is to speed up the debug process while also increasing confidence in the result.

“Most companies have this one engineer who can seem to get to the root cause of a problem really quickly,” said Hand. “They look at it, they start pulling up the signals, and find it, whereas the average engineer may take a couple of hours to get there. What if you can learn from those engineers? What if you can then suddenly start presenting the signals that are most likely to be impacting it? So instead of saying, ‘Here’s your signal, you could add 400 different ones to this,’ we could say, ‘It’s most likely going to be these two or three things,’ or, ‘Start looking at the log files.” If you find relationships between an error in the log, and an error in the waveforms, this lets you do some really cool things. And the more bugs you find, the better you get at finding the next bugs, because you’ve now got a breadcrumb trail.”

Inevitable Bugs
What differentiates an avoidable bug from an inevitable bug? Experts try to define the dividing line.
Searching For Power Bugs
To find wasted power means you understand what to expect, how to measure it, and how it correlates to real silicon. We are further from that than you might expect.
Breaking Down The Debug Process
Experts at the Table: Debug is not a monolithic task, and each stage in the processes needs a different focus.

Leave a Reply

(Note: This name will be displayed publicly)