First of three parts: Multi-contextual debugging; IP integration issues and who to call when you’ve got a problem; why it’s taking more time to debug; limitations of tools; attitudes of IP development teams to customer issues.
By Ed Sperling
Semiconductor Engineering sat down with Galen Blake, senior verification engineer at Altera; Warren Stapleton, senior fellow at Advanced Micro Devices; Stephen Bailey, director of solutions marketing at Mentor Graphics; Michael Sanie, senior director of verification marketing at Synopsys. What follows are excerpts of that conversation.
SE: What are the big issues with debug?
Blake: One is multi-contextual debugging, which is related to software in SoCs. When you’re dealing with memory space and embedded software is accessing things and the memory gets messed up and the hardware is trying to access things and you don’t know why it’s not working. The most powerful tool in your arsenal is a really good multi-contextual debug environment where you can set break points in your RTL, in your testbench, in your embedded software, so you are able to see all the pieces moving back and forth and determine why things get lost. We use tools from Cadence, Mentor and Synopsys tools, as well as those from other vendors, and being able to get all of them to give us the same capabilities would be really nice.
Stapleton: In addition to software, it’s an integration problem. I got involved with a fairly sophisticated internal proprietary debug system that would allow you to propagate debug information from the IP level to the SoC level. What I found over time is that it didn’t get a lot of traction. The IP team knew how to debug their IP. They’re very comfortable looking at raw data and can’t understand why no one else is looking at that. But even if they captured something at a higher level, where they would have automatic debug capability and could trace and link things, that information doesn’t get propagated up to the SoC level. When you’re at the SoC level you’re in the dark and wondering what’s going on here. I bring this up as an integration problem because it’s an area where I see very few SoC-level standards and methodologies that would allow tools to work across the SoC. This is another place where we need a common standard. Information has to be linked from the IP to the SoC level so the EDA tools can present a complete picture of what’s going on.
Blake: With vertical integration, where you take information from the block level and propagate it to the system level, it’s done poorly by most teams and tools. We see opportunities to do this, but they’re being missed. It needs to be explored much better than it has been.
Sanie: Debug is growing in terms of dollars and time. About 35% to 50% of the design process is debug, and it’s expanding. What’s really making it explode is the SoC—software, multicore and different levels of abstraction. Debug itself is becoming very specialized. Fifteen or 20 years ago you did a design and the verification engineer was doing debug. Now you have SoCs with software issues. And there is debug in software and debug across RTL and integration. There is all this specialization for each area.
Bailey: When we ask people how much time they spend in different parts of verification, debug is the largest slice. Our survey showed 36%.
Sanie: We say 35% as a lowball number, but it could be as high as 50%.
Bailey: These are the customers’ own perceptions. Most people don’t log their hours. But it’s still a big problem area, and the reason is that debug remains automation-resistant. How can you automate a process that requires an inherent understanding of what it’s supposed to do when it’s impossible to capture that design intent. If you use assertions, that helps. But it’s not an end-all. If you use UVM and some other methodologies it will help. There is a different team working on the design and integration of an SoC, and when a bug occurs they know conceptually what a block is supposed to be doing but they don’t know it intimately. It’s hard to figure out where the source of the problem is. On top of that, the SoC environment is extremely complex. You have multiple distributed finite state machines interoperating with each other, and understanding what’s going on is very difficult—especially when you’re looking at RTL signals. The transaction level isn’t sufficient for that. You have power states, different modes of operation and coherency. If you can narrow it down to a block, you probably have access to people who designed that block. But understanding what’s going on at a system level is very difficult.
Sanie: Then they pass it on to an IP integrator.
Bailey: We aren’t seeing them passing it on, but they are asking for assistance.
Sanie: Still, there’s a level of communication that needs to be enabled.
Bailey: There’s also an expectation that the folks at the SoC level do as much as they can before they call in someone to help. And software brings in a whole new level. With software debuggers, you have multiple clock cycles between when a break point happens and when you actually stop what’s going on. Any of the effects from the debugging are gone. You need a different solution to that.
SE: There’s no disagreement that it’s a huge problem, even though it appears everyone here is looking for someone to blame. But can this be broken down for divide and conquer, or is that one of the problems with debug?
Stapleton: It can be. Just like a design is hierarchical, and you have an SoC integration layer and an IP design layer, that primarily is focused on integration of the design. This is where we’re missing the extra piece that makes it easier to integrate and testable.
SE: But each design is very different, right?
Stapleton: It is, but the information you need is a transaction-level representation of what you see going on in there and how it’s transaction-related. Then the engineer can have knowledge about this transaction causes something else. But you don’t always have engineers to go back to.
Bailey: It’s all the legacy you’ve got.
Stapleton: What’s changing in SoC integration is that even some of the big companies that used to be monolithic now have to integrate things from other vendors. So you’re getting designs where you don’t have the legal access to information or access to the engineers.
Blake: It’s more about talking through the design because we have no idea who the person was who designed it. It’s supported, but when you get down to certain issues you can’t just pick up a phone and call someone.
Sanie: Is it on the issue of debugging the core itself or the integration?
Blake: It’s more the integration, but occasionally it’s the core itself. If you’re pushing a standard that you thought the core supported, but it doesn’t quite support it, now it becomes an arm-wrestling contest of what’s the right interpretation of the spec. They’re not sitting in the next cube. They’re in different city or country.
Bailey: If it’s a bug in the core that’s a problem with the IP, but if it’s in the integration hopefully it’s easier to resolve.
Sanie: We don’t just look at debug as finding the source of the problem. Maybe in this case, VIP would be the appropriate solution. But there’s a broader way of looking at this.
Bailey: In the test world, there’s the IJPEG standard that helps allow propagation of the JPEG interfaces. There’s a need for a standard approach to design verification and validation, where the IP provider can choose how much of this kind of information they export, how much verification/validation/instrumentation they can put into the IP. Even if it’s just reported back to the IP vendor, it should shorten the time it takes to resolve an issue. It’s not just the pre-silicon environment. You need this to work in an FPGA prototype and potentially even in silicon.
Stapleton: You look at something you don’t understand and your immediate response is that the IP is busted and you need to go back and talk to the designer, and the outcome often is that you configured it wrong. As a result, there are lots of requests going back to the designer that are a waste of time, and there is a barrier where they’re unwilling to respond to additional requests. Part of this is a bad classification of information that goes back to a small team, which limits how much they’re going to help you in the future. They assume that if they wait a few weeks the problem will get resolved by itself.