Breaking Down The Debug Process

Experts at the Table: Debug is not a monolithic task, and each stage in the processes needs a different focus.

popularity

Semiconductor Engineering sat down to discuss debugging complex SoCs with Randy Fish, vice president of strategic accounts and partnerships for UltraSoC; Larry Melling, product management director for Cadence; Mark Olen, senior product marketing manager for Mentor, a Siemens Business; and Dominik Strasser, vice president of engineering for OneSpin Solutions. Part one can be found here. What follows are excerpts of that conversation.

Semiconductor Engineering: How important is creating good tests when considering the impact they may have on debug?


Olen: There is an interesting parallel between the growing complexity of tests and debug, and coverage. Coverage has also progressed. There used to be a time where there was no coverage. Then we looked at it structurally. Let’s make sure we execute every line of code. Code coverage today is so elementary. You still have to do it, but it is not enough. It does ensure that you have verified your functionality. And now we are starting to see that, as difficult as it can be to create a perfect SystemVerilog functional coverage model—even if you cover all of that—you are still not covering everything in the design. The coverage models keep advancing to keep pace with the bugs and the failure mechanisms.

Melling: The system requirements always comes back to what is success. That means meeting the requirements, meeting the standard that was set for that design. Systems are taking on, with increasing complexity, much more use-case dependence. How is it actually going to get used? What are some of the critical factors, be they timing or power or bandwidth or latency? Those are the things that people are going to have to verify.

Fish: The big challenge for any company is the data loads used for verification. They are wrong. You do not know what your chip is going to see two years from now when it goes into the server or application. You make a guess, and that is why we see on-chip analytics becoming highly necessary. It is a chip, on a board, in a box, in a car or server room. In real-time, you need to be able to sniff within the design and then perform analysis on that data. Eventually people want to feed that back into something like an emulator to be able to run that in a more controlled fashion so that they can understand what is going on at a deeper level. But being able to do that level of analysis on the fly — because the application workload does change, the firmware does change — when you do an update, your design is different and the ability to analyze that is critical.

SE: Is shift left compounding the problem, adding back-end effects to the front-end such as variability, timing, aging?

Melling: I believe it is putting on more pressure. I see lots of customers who categorize bugs and say, ‘Here are the bugs I am finding in simulation, here is how many in emulation, and here are the ones for prototyping and here is how much I am still seeing in silicon.’ They look at that and are constantly attacking that problem. Two things have happened. When they do find something in silicon, they ask how do I reproduce those and roll them back into other environments such that I can shift left in the future for that class of failures. The mindset is always shift left. Bug escapes at silicon are expensive, so you have to analyze them fully and figure out how to plug the gaps. Then we keep adding complexity, so there will always be bug escapes. That will always be the case.

Fish: Shift-left has always been out there. It gets even more interesting when we talk about hardware and software. It used to be that people would design a processor and it got handed off to software. Now people really are simulating, or at least emulating, the combination of the two before they tape out because the interactions have to be addressed. That is a big shift left. Physicality is also interesting. Can you see ageing effects earlier?


SE: That also plays into safety.

Melling: That is one of the things with machine learning (ML) and it brings up an interesting economic analysis about what machine learning means to the economy. The point was that ML will make prediction a commodity. The value will be in judgment. Those technologies will allow us to do a better job at predicting what we should look at, but we will still need to add on top of that the necessary judgment to make it effective. This will affect a lot of things, especially things that can be statistically modeled like metal migration. They can look at the most likely places for hot spots and places of concern, and now you have to come up with solutions.

Strasser: As you go downstream it is garbage in, garbage out. You don’t have to care about the physical effects if your functionality is not according to the specification. Eradicating the functional bugs is the key to everything. Safety and security — you don’t have to care about them if you have functional bugs.

Melling: There are always table stakes.

SE: You go into the debug process once you have comprehension. Comprehension normally comes through some process of abstraction, and abstraction relies on getting the right kind of detailed data. As an industry, where is it most effective to put our resources, and what are we doing in each of the areas? Debug is at the end of the process chain, but a lot of work leads up to the ‘aha moment.’

Fish: The term debug makes the hair on the back of my neck stand on end. We consider it to be analytics. While that may sound like mincing words, you really are analyzing data trying to find out what is going on. In reference to machine learning, there are huge opportunities. In our case there are stages in the design where you have massive data, be it from emulation or silicon. We look at how you can apply machine learning for anomaly detection. You can see something, but you still have to go back to root cause, and that is a complicated problem. There are inferencing solutions that could be used for anomaly detection. Clearly there is massive data available. We have more data available than you could possibly use, so how can we use that to help us get back to root cause?


Melling: You make a great point.

Olen: It doesn’t just happen at the end. It might feel like that when I am done debugging. Then I am good to tape-out. But actually there are so many of these things that you are debugging that you are really debugging throughout the entire process. Sure, there are some that are toward the end, but there are also some that you debug or analyze very early on. You debug your firmware, you debug your design IP, you debug your bus architecture, you debug the memory controller. Debug is always happening. There is no such thing as a monolithic debug tool. You need a suite of tools that do analysis and debug throughout the entire process. Some will be focused on debugging the testbench, debugging my functional safety requirements — it is like back to the old test days with fault simulation. It is not an n2 problem, it is an nn problem. You can’t do everything, so you have to figure out the most important things to do. That is one of the reasons for coverage models, which define what you want to see. You buy the debugging technology, the analysis technology to solve the problem, and they had better work together. We try to provide a common look and feel to them. They need to have context awareness and know what you are looking at for any point in time. It is a suite of tools.

Melling: I have to agree that debug is the wrong term. It is an analytics problem. We should take this to heart. If you look at time spent, I know engineers consider the time they spend triaging regression is debug. That is what they would label it, but there isn’t anything happening that can be called debug. It is analytics—looking at the available data and information and ranking and deciding what is the most effective one of these failing tests that represents a class of failing tests that I can go off and debug. It was analytics that led to the proper selection of the test that will provide an efficient debug vehicle.

Olen: Root-cause analysis.

SE: Is the work being done on automating debug, or comprehension and analytics, the first part of a system that could be used for anomaly detection—something that could be foundational to real-time security? Thus, whenever a system sees an anomaly, is that potentially the initiation of a security breach?

Fish: We do believe that. Our growth market is functional safety and security. In life, we can see these things when running. We don’t do anything about them today, but we can assert something, send a flag that says something strange happened. When you embed the analytics engine itself, it could also be updated. The bad guys will change over time and you need the ability to do on-chip analytics for functional safety or security. We call that bare-metal security. You will have TrustZone with Arm, which does a good job, but if you want a watcher to watch over the watcher…

Melling: That is a cool application and could be a potentially good approach to address security and its dynamic, ever-changing nature.

SE: Do we understand enough about artificial intelligence and ML such that we can first apply it to the debug problem and then later potentially for the security problem.

Olen: I would stay on the, ‘How do you apply it to the analysis’ problem. This is trying to locate the root cause of bugs. This is no longer a process that starts with, ‘Show me a waveform display and let me start back-tracing on a schematic.’ That is traditional debug. In this case you are looking more at the system level.

Fish: Traffic flow problems.



Leave a Reply


(Note: This name will be displayed publicly)