Experts At The Table: Debug

Second of three parts: The challenge of IP; what gets automated; false positives; the need for context; divide and conquer strategies; the benefits of reuse.


By Ed Sperling
Semiconductor Engineering sat down with Galen Blake, senior verification engineer at Altera; Warren Stapleton, senior fellow at Advanced Micro Devices; Stephen Bailey, director of solutions marketing at Mentor Graphics; Michael Sanie, senior director of verification marketing at Synopsys. What follows are excerpts of that conversation.

SE: The amount of IP is increasing and interacting in ways people didn’t always expect. But it’s also black boxes. What happens when you can’t see inside?
Blake: So you have 1 million lines of code from your FPGA took, 500,000 lines of code in your design and another 500,000 lines in your testbench. If you look at where your bugs are, that’s roughly 25%, 25% and 50%. And a lot of times you get tricked. What looks like a bug is actually a problem someplace you weren’t looking. It isn’t always the IP block. Sometimes it’s the VIP, sometimes it’s the tools you’re using that give you a false report. You have to go through detailed logs to figure out what’s wrong. Maybe it misinterpreted something.
Stapleton: Part of the analysis is whether you’ve configured IP in a valid way.
Blake: That is definitely one of the challenges.
Sanie: You write code in C++ and there are no tools to do the debug. That’s a whole new way of adding problems into the design.
Stapleton: If only 25% of the problems were in verification, that would be a goal for us to achieve. There are no processes in place to make sure the verification code is good. Part of the problem may be the high-level model. There are not techniques and tools to verify that.
Blake: The SystemVerilog code is half hardware and half software. The object-oriented portion of Verilog is software. It’s not RTL.
Bailey: And it brings different debug paradigms. You can look at transaction stuff, but it’s higher-level message passing sequences of events and being able to follow those that becomes difficult.

SE: How much of debug can be automated?
Sanie: There are methodologies out there to verify software code. There are RTL and formal. There is code coverage for testbenches, too, although I’m not sure how effective that is. There’s a whole new paradigm that doesn’t get automated, too, and it’s hard to find those bugs.
Bailey: I don’t know of any automation in debug. I know of ways of presenting better views of what’s happening.
Blake: Michael has a point. We’ve got code coverage tools to take a magnifying glass to RTL. We have some RC code. It would be nice if our EDA partners would give us the same kinds of things our customers have. That kind of analysis would help us with our testbenches so we’re not wasting time thinking we have an IP bug when actually it wasn’t one.
Stapleton: There are some carrier tools that inject something into your RTL to see if they can catch a bug, but they don’t tell you if you’ve messed up your verification environment.

SE: What are EDA customers saying?
Sanie: It’s very consistent. There are new elements that get involved here. One is analog/mixed signal. Another is low power, which is a big issue. It’s the same kind of complexity with different flavors of it. So think about protocols. These are all very complex. If you’re working with USB, it’s very complex and you can’t call the guy next to you and say, ‘How do I deal with this problem?’
Bailey: If you look at trying to find ways to automate debugging, the one thing that we’ve seen that’s successful is being able to trace back automatically. You can trace back values to the point where it’s no longer that value. There have been attempts to use formal to debug, but you run into problems—especially at the SoC level—because of the state space explosion. It becomes very limited very quickly.

SE: Everyone is building some level of formal verification into their platforms, right?
Bailey: Yes. We’ve had a lot of success using a formal engine in conjunction with verification IP. But that’s not our main debug. You have to understand what it was supposed to do. Where we’ve seen the best use of formal is assertions narrowed down to hard-to-find bugs. Often we hear about bugs that show up and people can’t recreate them in simulation reliably, so we come up with a set of narrow assertions to keep focusing and focusing. That way when they find out what the problem is, they can use the same set of assertions to verify the bug is fixed. But that’s pretty extensive. Most people never go to that kind of effort up front. Even if you just do this up front with the IP interface blocks to make sure you integrate correctly and that you use it in the way it was intended to be used—most people don’t even do that. Third-party IP providers should do that. The other option is raising the abstraction to provide better ways of understanding what’s happening. So if you have a finite state machine, it’s hard to use formal technology or even synthesis-level technology to understand a distributed finite state machine. At least for UPF you can specify the power states in there. But there’s nothing there to more generally describe these things at a system level, which would be very useful. We’re looking at ways to do this so you can understand coherency and coverage metrics at an SoC level. The more we go in that direction, the more the user will have to provide information to help the tool understand the intent, or it will be a broader class of verification IP where you understand the application domain like coherency or the AMBA protocol around it.

SE: It sounds as if the problems in debug are directly proportional to the rise in complexity. Are we looking at the problem wrong? Does a modular approach, such as stacking die, simplify this problem?
Blake: People get too crazy and fancy. You have to find ways to deal with this so you’re not confusing yourself. You may have to turn off a bunch of features so you’re not confusing some other part of the system. People tend to be a little too ambitious and not focus on the real objective of the test.
Sanie: So the recommendation is to do less design?
Blake: Not less design. But it’s a matter of what feature in the design you’re trying to test.
Sanie: You’re taking a more modular approach.
Blake: The key is not too many distractions.
Bailey: That works to an extent. The traditional way of managing complexity is to break it down into manageable pieces. But at some point, with the complexity of today’s SoCs, that involves ways these things interact. You still need to provide those scenarios, because if you don’t your customers will.
Blake: That isn’t the only thing you have to do, for sure. But it’s one step you should do. The problem is debug, not system-level validation. For debug, you’re trying to pick away complexity.
Stapleton: Related to that, I’ve seen a lot of effort going into ways to architect away that complexity. These chips already have enough performance, so now the shift is toward more customization and efficient design. Because of that, some of the complexity can be removed.

SE: Isn’t that application-dependent? In the data center they want as much performance as they can get.
Stapleton: Even there it’s more about low power. So now you can take your foot off the accelerator and start worrying about other problems.
Sanie: But now you have additional things to debug.
Bailey: Getting back to breaking down the problem and 3D, that’s just adding another level of hierarchy. The benefits you get in terms of reduced time in debug is from the reuse factor. The more something is re-used, the more stable it is. If you can get to the point where you have a wafer in 3D silicon that’s completely re-used over multiple generations of chips, you’re saving a lot in terms of design and verification debug. If you take reuse today from the USB peripheral block all the way up to a wafer on a 3D stack, it’s a huge savings. You still have to do it, but you do it once.
Sanie: IP guys are looking at this. Re-using subsystems is already happening.

SE: This is pointing toward a platform strategy. Where are we with that?
Blake: We certainly do pull together subsystems. The challenge in debug is when you have a combination of those offerings on the FPGA side, and the SoC side where you have a hardened SoC block in the middle of the FPGA. You have soft logic in the IP and the SoC itself. When you look at how these things interact, the complexity skyrockets. If you look strictly within the SoC, we do less of that.
Stapleton: Over time people develop subsystem and debug, and what we find is the boundaries they use aren’t always appropriate for the new world. Subsystems have to be broken down and reassembled into a different subsystem. If you have the boundary in the wrong places, it’s worse than having none.
Blake: And when you look at overall power and performance, they almost have to be redone. You’re trying to take advantage of technology that wasn’t involved in the subsystem.
Stapleton: On the methodology side, subsystems create areas where people add their own methodology within that subsystem. It becomes more difficult to integrate with a different subsystem that comes out of a different team. You can still take individual pieces and integrate them. You don’t necessarily have to integrate the whole. That’s often a better approach.

To view part one, click here.