Debugging Complex SoCs

Experts at the Table, Part 1: Why time spent in debug is increasing, underlying trends, and what surveys do not reveal.


Semiconductor Engineering sat down to discuss the debugging of complex SoCs with Randy Fish, vice president of strategic accounts and partnerships for UltraSoC; Larry Melling, product management director for Cadence; Mark Olen, senior product marketing manager for Mentor, a Siemens Business; and Dominik Strasser, vice president of engineering for OneSpin Solutions. What follows are excerpts of that conversation.


SE: The trend data shown in the Wilson Research Group/Mentor study is not encouraging when it comes to debug. In the latest survey, verification engineers spend 44% of their time in debug, and that figure has been growing. What is happening?

Olen: Harry Foster manages this study. It is a double-blind study that covers the whole industry, not just Mentor customers. It is worldwide. We try to keep the questions as consistent as possible every year so that we can do trend analysis. It has been fairly consistent over the past eight years (four surveys) that debug is ranked as the number one, most time-consuming challenge of a verification engineer. Two questions are get asked. One is what is the most time-consuming, and the other is what is the most critical. Debug is the most time-consuming and is also rated as the least predicable. You could get lucky, but you could also get very unlucky when it comes to debug. The time spent in debug keeps getting worse, in spite of Portable Stimulus, in spite of verification IP, in spite of UVM and constrained random testing and many other new technologies. Things still fail in the lab and you need debugging technologies that cover simulation, emulation, formal verification, clock domain, power domain, reset domain, caching systems, arbitration schemes—all kind of different things that add to debug complexity. Many of us are investing a lot in this problem.

Fig 1. Where IC/ASIC verification engineers spend their time. Source: Wilson Research Group and Mentor, a Siemens Company.

Melling: Debug is one of the places where it is easy to have a conversation with a customer, especially if you even give a hint that it might have a significant impact on cost. You will have their attention and you will keep their attention while you work through it. No doubt there is a problem, and everyone is investing in it. What is going to happen and where the solution will come from? We will broaden the scope in terms of data analytics and be able to take more information into the equation to be able to do more prediction, use more formal technology—a lot of different things that we can bring to bear with more data to be able to give stronger indication of root cause and get you there faster. If you have more targeted stimulus, it gives you a better chance of debugging, but that alone is not enough.

Strasser: We need a complete analysis of the problems. What do we know? We know that we need to get the quality level up at the block level. That is our mission. But in the study, are they really analyzing what the problems are?

Fish: Do you mean that it doesn’t quantify the root cause in some manner?

Strasser: Exactly. What is the root cause? Where are the majority of the problems?

Melling: That is interesting. We have been looking at these technologies and it is a question that I have asked customers. For example, Verifyter has a tool that enables you to find out which commit caused a problem. So you can ask, ‘Why isn’t everyone buying that?’ I asked a few customers that question and they say it only addresses one category of problem, which is associated with a degradation in my design or testing. As I am adding tests and finding new bugs, it doesn’t help. So you have to look at the classifications of bugs as you talked about. That is where data analytics can help. Customers are just starting to look at maintaining those kinds of data lakes and have the kind of information to know if this is a test that has always passed, is this is a new test. There is some interesting information that Arm has published about their data lake and some of the analytics that they have done on tests, bug failures, where they are found, etc.

Olen: I don’t remember every detail about the survey because it is a pretty long survey, but in general, what you are asking for is delicate information. We are all involved in verification in various parts and the last thing a respondent wants to do is to highlight the details of their problems, such as if they have had problems with their cache or arbiter. We can see the general trends, but we do not have a lot of detailed data about the types of problem that they had. If you found it and solved it, you don’t want to share that with your competition broadly.

Melling: That is true.

Fish: We are a little different because we are an IP company that provides IP that helps with the debug problem. We get used heavily post-silicon and in emulation. It is nice that we are cohesive through the life of the SoC from RTL through system into mission. But the definition of a bug, to be pedantic, for us is quite often not a functional failure. It is a DMA and queuing sub-system. When ‘this processor’ is running, something weird happens over here and it creates a performance degradation. Or it takes real-world data, as in cell phones handing off between towers before certain conditions are triggered and we see these performance problems turn into a functional failure in an operational sense. You really need in-circuit monitoring IP in the world of heterogeneous chips with complex memories, and they do require real-world data.

SE: Is it possible that the time spent in debug is not actually getting worse, but instead the industry is beginning to look at more aspects of an SoC and calling it something that needs to be debugged—like throughput or power? Are we expanding the scope of what it is that we mean by debug, and is that why the numbers are increasing?

Melling: That is a good question for the survey. I don’t know how they define debug or the details.

Olen: It is a combination of that as well as that things are not designed in a monolithic way anymore. You buy your IP from one company, and some IP from another company, and then you bring it all together in an SoC or FPGA platform. Even your own teams internally are distributed around the globe. So if you have a bug at the SoC level, how do you triage that? How do you isolate it? Is it a bad block? Presumably the blocks you buy are pre-verified, so it is something about the inter-block communications across the chip or fabric—or we are all seeing the impacts of increasing amounts of software into this world. You did hit it on the head. We are now seeing something that used to be classified in performance analysis as bugs.

Fish: I kind of see this with our IP. Hardware people make some of the decisions up front—the architects. Once it is in silicon, it is owned by the firmware teams. They want to pump data through, and they are trying to figure things out. They work at a bus transaction level and do not care about the ones and zeroes. They are just trying to monitor the flow of data and determine where the bottlenecks are.

Olen: This area is ripe for a data lake, machine learning type of application—not so much at the block level, where they are using functional coverage, but perhaps a little. When you get to the SoC-level, now you are looking at emulation runs, or FPGA-prototype runs, with tons of traffic. And now you have the ability to start looking at things on a repetitive basis. We have some technology in data mining that was able to look at simultaneous executions of instructions/operations and found that the majority of the time things were finishing out of order. Out-of-order execution is part of the characteristics of this design, but the system architect was able to look at the plot that was generated by a data mining technique and identified that he should take a look at that. He saw that there was an arbitration problem. It was providing an unfair advantage to XYZ or DMA request this or that. Was that a bug? He looked at it as a bug. But it returned the right data with poor performance.

Strasser: We do see new containers of bugs, like security and side-channel attacks. These are new concerns, and nobody has thought about these in the past. Suddenly, things are seen as being malicious. Out-of-order execution suddenly becomes a side-channel attack and someone is listening to what your processor is doing. It is eavesdropping. Is that a bug? Where is the bug? The bug is in the specification. The bug is in the way the system is constructed.

Fish: It is hard to quantify security. It is hard to optimize anything that cannot be quantified. It is like playing whack-a-mole. Every time there is a breach you learn how to stop that class of problem.

SE: If we do not extend all the way to security, safety is a new concern for many industries. That has brought new issues to the table. This clearly asks when a bug is injected, is it detected and if not, why not?

Melling: What is classified as debugging has broadened in scope, and chances are the people answering these surveys are reflecting that. They are having to debug a lot more than in the past, and new types of issues. How do you look at coverage for performance? These things may be critical to the system. There are workloads that have to be created. There are measurements that have to be taken and analytics that have to be performed to be able to sort through these issues.

Leave a Reply

(Note: This name will be displayed publicly)