Can time spent in debug be reduced?
There appears to be an unwritten law about the time spent in debug-it is a constant.
It could be that all gains made by improvements in tools and methodologies are offset by increases in complexity, or that the debug process causes design teams to be more conservative. It could be that no matter how much time spent on debug, the only thing accomplished is to move bugs to places that are less damaging to the final product. Or maybe that the task requires an unusual degree of intuition plus logical thinking.
Regardless of the explanation, data shows that time spent in debug has resisted any reduction in resources expended. So is the industry missing something, or is this just something that we have to accept?
Fig 1: Where verification engineers spend their time. Source: Mentor, a Siemens Business/Wilson Research Group 2016 Functional Verification Study
So how much time is really spent on debug?
“It is difficult to be precise, since debugging is pervasive and an integral part of every aspect of the development process,” says Harry Foster, chief scientist for verification at Mentor, a Siemens Business. “The same study showed that design engineers spend about 53% of their time involved in design activities, and about 47% of their time involved in verification activities. From a management perspective, debugging is insidious in that it is unpredictable. The unpredictability of the debugging task can ruin a well-executed project plan. Clearly, when you consider how debugging is required for all tasks involved in a product development life cycle, anything that can be done to optimize the debugging process is a win for the organization.”
It usually takes an innovation or paradigm shift in an area to have an impact. “We used to have a gap between the number of transistors a day that a designer could create versus the available capacity and synthesis that had an impact,” points out Larry Melling, director of product management and marketing for verification products at Cadence. “Debug is sitting at that precipice. The fact that it is staying around 50%, and yet people are spanning a broader spectrum, is a contributor to why it stays constant. It is not that they are doing the same kind and amount of debug. It is that what is being done is growing. So the overall number looks static.”
To dig into the subject, we have to consider dividing the problem into two. “First there is the debug of a deep issue in the design,” says Doug Letcher, CEO of Metrics Technologies. “Perhaps the designer is building a new feature that doesn’t quite work correctly, and they need to debug that in a traditional sense. This may involve stepping through code or looking at a lot of waveforms. The second aspect of the debugging effort is debugging in the large where you are trying to stay on top of breaking changes as you are adding more tests or fixing bugs.”
Catching the bug early
It is better that bugs are never created. “Improvements can be made to debug, but a lot more must be done to avoid errors,” says Sergio Marchese, technical marketing manager for OneSpin Solutions. “When we fail with that, we need to be detecting errors sooner and in simpler contexts. Somewhat ironically, the good news is that safety and security requirements are indirectly forcing companies to take this direction.”
Researchers within the software community have been studying bug densities since the mid-1970s. “For example, the number of bugs per 1,000 lines of code,” says Foster. “This metric has remained relatively constant for structural programing languages (in the order of 15 to 50 bugs/1,000 lines of code). I know of a couple SoC project managers who have been tracking bug density for a number of years, and have found similar results. The interesting thing about a bug density metric is that the number of bugs per 1K lines of code is fairly consistent whether you are dealing with an RTL model or an (HLS) model. This is an argument for moving to HLS when it is possible. Not only are you accelerating simulation performance by one or two orders of magnitude for the HLS code, but you have introduce fewer errors by designing at a higher-level of abstraction versus designing at RTL.”
If bugs cannot be avoided, then finding them early is helpful. “Once an error has made it through, it is crucial to detect it as soon as possible and within the simplest possible context,” says OneSpin’s Marchese. “For example, FPGA synthesis errors can be found immediately with formal equivalence checking, rather than during gate-level simulation or, worse, lab testing. Similarly, detecting a memory architecture or IP-level shortcoming, causing a security issue at the system level, is far less efficient than taking a more rigorous verification stance using formal to systematically clean up all fishy functionality with short, easy-to-analyze scenario traces.”
Mentor’s Foster agrees. “The bottom line is that a project needs to constantly re-evaluate itself and ask, ‘Should the bug have been caught by an earlier verification process?’ For example, would that bug have been found using lint before simulation? Could that bug be found using unit-level testing versus full system-level simulation? The farther down the verification process a bug escapes, the more time required and costlier it is to fix.”
Companies must have processes in place for catching bugs. “As you add tests to fill coverage holes, you find new bugs and fixing those creates more bugs,” says Letcher. “So in that process it is important that you stay on top of things and catch changes that cause problems quickly. New methodologies, such as continuous integration, mean running more simulations earlier, as soon as check-ins happen. This helps you to find the bugs at an earlier point in the process when the changes are still fresh in the mind. Compare this to a company that only runs their regression once a week. Now there could be hundreds of changes that happened between where it became broken and now, so you have to sort all of that out.”
Expect the unexpected
Continuous integration and frequent regression runs appear to be a methodology in place by advanced organizations. “We are seeing good return on the concept of regression-based debug,” says Melling. “You run a lot of tests, and being able to characterize those tests and find the failures and rank them – so how many tests are hitting this first failure condition and being able to make it easy for the engineer to choose the most efficient test to debug – which has the shortest runtime for example. We can add more analytics and more science in place to help with that.”
“Tools can apply big data analytics to aggregate and filter simulation log file data to measure coverage progress against pre-set targets across any number of axes, such as line, branch, FSM, code, and functional,” says , executive vice president at IC Manage. “By retaining and analyzing this data across the spectrum of results over time, managers can accurately predict milestone dates to better optimize and apply their resources to accelerate debug schedules.”
A lot of data is available. “We are trying to solve this problem in a way that provides visibility into all of the data that is already there,” says Letcher. “Whether it comes from an emulator or simulator doesn’t really matter. What you want to know is whether that is the last time that test passed, and what changes happened between then and now, who made the change, when and why. That data can help you to isolate the debug issue a lot easier than just starting from saying something is broken.”
Regressions are not just for functionality. “Power regression has evolved and is seeing increasing adoption,” says Preeti Gupta, director of RTL product management for ANSYS. “Much like functional regression, people have become more methodical about it and running regressions and tracking power consumption, such as percentage metrics. Any time you see a change you should be able to attribute it to something that you changed in the design. You are no longer looking for a needle in the haystack. Regressions cut debug time quite a bit, not just because they are measured against a golden standard, but just be measured relative to what it was.”
Picking the right tool
Not all tools are equally useful for all jobs. “When you move to emulation or prototyping, you shift into a software development mentality,” says Melling. “You are running and debugging software, and that is the focus in that kind of engine. With an emulator, software plays a role but is much more about system-level use-cases and workloads. I want to get a feel for power consumption during a particular workload. It is all about what kind of test I am running and that provides the context.”
What happens when a hardware bug is found in an emulator or prototype? “Sometimes, the visibility in an emulator is not as much as it would be in a simulation environment and you have to reproduce the bug where it can be debugged,” adds Letcher. “The need to go to things such as emulation is driven by more complex designs, but you have to reproduce the issues back in a simulator.”
Vendors are rapidly attempting to bridge the divide between engines. “Selection between FPGA prototyping and emulation is a tradeoff between speed and debugging capabilities,” points out Zibi Zalewski, general manager of the hardware division at Aldec. “Prototyping provides higher speeds, since clocks are driven directly from clock oscillators or PLLs. But they run at constant frequency and cannot be stopped, which is necessary for advanced debugging.”
Emulation runs can be long. “To maximize verification efficiency, companies need to plan upfront the strategies for tackling debug,” points out Rob van Blommestein, vice president of marketing for Oski Technology. “Formal is much faster than simulation for debug. Formal traces are much shorter, and the assertions usually narrow the scope of signals to debug leading to debugging only tends of cycles of behavior versus thousands of cycles in simulation. Debug time can typically be reduced from a ratio of three hours in simulation to 30 minutes or less using formal.”
What the engineer really cares about is finding the bug. Melling aptly concludes “the art of debug is presenting the right visualization and context needed to quickly get to the error.”
New areas for debug
In the past few years, system-level verification has become a lot more visible. Until recently, system-level tests were hand written and orchestrating all of the necessary events in the system was complex and tedious.
But there is hope on the horizon. “The emerging Portable Stimulus was born out of that need to create automation and make it possible to generate that kind of complex test in a correct by construction fashion,” explains Melling. “With that came new paradigms in debug. In our solution, we use UML diagrams to represent complex tests because it is easier to understand the test flow. You can see that these processors are doing these activities in parallel and this is happening at this point in the test rather than trying to sort it out by looking at eight different cores and their programs and figuring out who is talking when and to whom. It becomes a tractable debug problem.”
Verification also has become a flow. “It’s no longer possible to design and verify complex SoCs using only a single verification engine, such as simulation,” points out Foster. “Today a spectrum of engines are employed ranging from virtual prototyping, formal property checking and formal apps, simulation, emulation, and FPGA prototyping. This has forced us to rethink the debugging process. For example, debugging has moved beyond being viewed as a simple waveform viewing tool attached to a single engine. Today, we think in terms of debugging environments that span across engines. These environments must provide a full set of synchronized views, including logic simulation waveforms, processor states, source code, internal memory, registers, stacks, and output. This is particularly critical for software-driven test.”
Debug is more than just ensuring that functionality is correct. It includes every aspect of design that involves writing a description, making a decision, or guiding a tool potentially introduces errors. For example, describing power intent can create functional errors, or producing a layout could create design rule violations or introduce errors involving electromagnetic coupling.
Power is one area receiving increasing attention. “In the past, designers would run a tool that would provide a power number,” explains Gupta. “How do they know that the number they receive is what they should have expected? A total power number doesn’t tell you a lot. As a result, percentage metrics have been developed. How many clock cycles were really needed based on the number of cycles that had data changes? You are looking for a needle in the haystack, and this helps to locate it. Tools are evolving that make visualization better to be able to spot anomalies.”
That can require new foundational technologies just to make it possible. “Innovations have happened in the area of fast tracking engines so that debug of where to focus is somewhat automated,” she adds. “Most design companies care about functionality first and time to market. There is usually very little time that is provided for optimization beyond the creation of the architecture. When you provide an interactive environment, that brings issues into focus, and then provides the data to help understand why. You can set thresholds, you can measure against peak power, peak to average ratio, or even a sharp increase or decrease in power.”
New tools
New tools and technologies are constantly emerging. “A lot of big-data techniques are coming into play in the debug world where you want to analyze across large amounts of data and perform data analytics on top of it,” says Gupta. “Take the human factor out as much as possible.”
Ideally, the users want root cause analysis. “Root cause dates back to driver tracing,” says Melling. “Root cause analysis enables you to trace back further to the cause instead of doing it one step at a time. You want to take it back to the original source and then show all of the influencing things along that path. It can provide a visualization of the things and places that I went from this source out to this destination, and all of the control signals that could have influenced the result. They are getting better, but it does not work for every type of problem out of the box. The analytics available now with machine learning will be brought to bear on these problems.”
Utilizing big data techniques can enable new kinds of problems to be located. “Moving from TCL ( Tool command language ) to an object-oriented language like Python with a distributed database that can access shapes, instances, circuits, and timing paths in a MapReduce system (similar to Hadoop) makes queries sleek and quick,” explains Sankar Ramachandran, area technical manager at Ansys. “A typical example of a MapReduce job would be to find out clock buffers that are passing through high power regions. MapReduce adds more flexibility so the users can build custom applications on top of a distributed database and architecture.”
And there will always be a choice of tools that can be applied to debug. “There is another form of bug density that has been studied within the industry, and that is classifying the number of bugs by types of designs,” says Foster. “Researchers have shown that designs that are concurrent in nature have five times the number of bugs compared to designs that are sequential in nature. It turns out that concurrent class of designs lend themselves to formal verification. So to increase verification productivity, it is best to utilize a verification engine optimized for a particular class of problems.”
Time spent in debug is unlikely to change the next time the survey results come in. There appears to be a natural balance within the system, and new debug tools are kept in balance against new problems that need to be debugged. If anything, debug may be winning.
Related Stories
Is Verification Falling Behind?
It’s becoming harder for tools and methodologies to keep up with increasing design complexity. How to prevent your design from being compromised.
How Much Verification Is Necessary?
Sorting out issues about which tool to use when is a necessary first step—and something of an art.
Which Verification Engine?
Experts at the Table, part 1: No single tool does everything, but do all verification tools have to come from the same vendor?
Brian wrote “Regardless of the explanation, data shows that time spent in debug has resisted any reduction in resources expended. So is the industry missing something, or is this just something that we have to accept?”
Debug is all about visibility — the previous state, input changes, events, current state.
Control is also important to be able to stop and see the values at any point in time. (without setting up and running simulation then pawing through tons of wave forms)
State changes are determined by relatively few inputs and events and possibly only a single previous state. Therefore it is not required to toggle all possible combinations of inputs.
On the software side there are debuggers that can set breakpoints, watch variables, single step, etc. On the hardware side there are simulation wave forms — whoopee!
Hardware has inputs, outputs, registers, memories, Boolean nets, and data buses, and ALUs.
Simple software classes can be defined that correspond to those hardware facilities then hardware AND software can run together on the same platform using the same debugger.
Does this sound too hard?
“Normal” software debugging features including breakpoints and watches certainly are available with many hardware platforms, and the open standards they use can be implemented if desired. Atmel Studio serves as both an example, as well as a template for creating a customized hardware-integrated development environment. Whether you take the time to construct that kind of debugging environment depends on how crucial on-hardware software debugging is to the product. If you’re creating a new development platform for your amazing video DSP chip, then that might make sense, or at least to create libraries for existing, extensible, development platforms such as MatLab…like a lot of vendors do now. The new concept of the “digital twin” will first find it’s application in simulated twins of machines and systems where the simulation isn’t first-principles based, but instead learning-data driven. Or not… 😉