Experts at the table, part 2: Reducing duplication in development, debug and verification; differences between hardware and software methodologies; why now?
Semiconductor engineering sat down to whether changes are needed in hardware design methodology, with Philip Gutierrez, ASIC/FPGA design manager in IBM‘s FlashSystems Storage Group; Dennis Brophy, director of strategic business development at Mentor Graphics; Frank Schirrmeister, group director for product marketing of the System Development Suite at Cadence; Tom De Schutter, senior product marketing manager for the Virtualizer Solutions at Synopsys; and Randy Smith, vice president of marketing at Sonics. What follows are excerpts of that conversation. To view part one, click here. To view part three, click here.
SE: How does Agile really look on the hardware side? Does it work?
Gutierrez: One of the fundamental notions of Agile is that the team commits to what they’re going to get done in the sprint. On the software side, assumptions, methods and files are more easily interchanged among team members. As one team member finishes up his or her task, they can jump on another task and help close out the sprint. On our side, we do have a team but they’re all assigned their own modules or their own IP blocks. It’s not that easy for one engineer to jump over to another engineer’s task. So when we reach the end of a sprint and we’re not able to close everything out, it gets carried over to the next sprint. That’s a little different from the software guys. The other thing I’ve noticed as that when people talk about Agile they talk about continuous integration. It’s becoming an important part, but technically it’s not part of Agile. With Pivotal, they were showing the status of their builds, and they do multiple builds per day. We’re certainly trying to do something similar, but we’re having a hard time getting through the flow. Our regimens can run for days. We don’t have the throughput to do continuous builds on our side.
Schirrmeister: The difference between Agile and continuous integration is an important point.
Gutierrez: Traditionally, from a chip design standpoint, the DV team would tick off a weekly regression. It might run a daily regression, which is smaller, but at the end of the week it would run the full weekly regression. By the time you find the bug, though, you’re running on code that’s almost a week old. If you can run those regressions more frequently, you can catch those bugs sooner. And if you push that to the limit, you can run daily regressions. Engineers typically are not allowed to check in their files. They’re only allowed to do a regression run. Once they complete their regression and pass their regression, then the file gets checked.
Schirrmeister: In a lot of cases we’re seeing, if the change is significant there needs to be a formal review before you’re allowed to submit it to regression. On the hardware side, you have all your tests and you see what gets updated every night. They don’t have the bandwidth to run all of the regressions in the time they need, and that’s where it comes down to how fast does your regression run. The very big loop to the hardware may be six months. The smaller loop for continuous integration and regressions may be overnight. That’s where hardware acceleration comes in. It helps to merge all of that data into front-end verification management.
Smith: We’re stuck in a methodology today, though, where we don’t do enough unit testing. Then, in a very silly fashion, we test the same things all over again that we did in the unit test. There’s a lot of wasted effort. In the world we live in, which is network on chip, the customer says they have a problem. That problem often turns out to be some block that was producing a bad protocol. You could have found that at the unit level, way before you ever got to the performance-level analysis you’re doing for the network. And you should have. Part of it is that we’ve told system engineers to go run this big system test, and then run a big regression, as opposed to figuring out what you can test locally and then not bother to test that later because you’ve already tested it.
Schirrmeister: There is room for improvement. But there is a difference in scope between the unit test, the subsystem test and the system test. If it runs through to emulation and system-level test, and you find a bug that should have been caught at unit test, then you go back and beat up those guys because that needs to improve. It’s a different level of scope. If you find bugs in the units, something has gone wrong.
Brophy: And the challenge is to find something that isn’t working, as opposed to coming out of the unit test where they’ve proven something does work. That’s a mindset change. They’ve coded it and it’s ready to go, and unless someone proves it’s not working, it works. That’s where Agile helps. You’ve tested it from one test to another and you can say, as a team, this does what it was spec’d to do.
Schirrmeister: I fully agree, but the scope is different. That’s where this notion of multi-verification comes out. You need to have system-level tests that clearly define what the scenarios are that are driving the system integration into how the systems are supposed to perform. And you need to take those down into the IP, so that if you find an issue in the protocol level at the PCI level, then something went wrong at that level then you need to either question your team or talk to your IP provider.
SE: Is this what’s happening?
Gutierrez: Years ago we would allow the IP developers to go off and develop their IP blocks. At the chip level, we would integrate those IP blocks knowing they were done or almost complete. Now, with shift left, we’re being asked to provide a platform to the software guys earlier and earlier. That means we’re developing our IP at the same time we’re doing top-level integration. By the time we integrate these blocks at the top level, the IP is not done yet. It’s not that easy to go back and beat up on the IP guys because it’s not done yet. It’s very challenging. We’re doing IP developing, top-level development and software development at the same time.
Schirrmeister: So you’re talking about the new IP development, because very seldom do you have a complete greenfield development without any reuse. One of the discussions at DAC was that the system-level tests help the IP-level tests, for the new cases especially. The three cases of reuse there were vertical reuse from IP to subsystem to system level. The system level tests can provide insights into how to test the IP, and vice versa. The second notion of reuse is horizontal. That’s virtual RTL, emulation, FPGA and even the whole shift left there. The last level of reuse was reuse between disciplines. The power guy had his view of shutdowns. The coherency guy had his notion of cache coherency. The software integration guy had his own. And these guys have a hard time talking at the chip level, so Agile is an advance to the next level of continuous integration. That becomes so fast at some point that it’s becoming Agile.
Gutierrez: As we integrate the IP blocks into the top level and they get to run their stuff, the goal is to get this thing up and running. But the problem is that if you do get it up and running in the lab, there’s a tendency to cut back on verification. If it’s been running in the lab for a couple days, do you really need to run all these tests? That’s something we need to monitor closely.
De Schutter: If you look at these unit tests and information on smaller items, it is contradicting what I’m hearing from customers. You can look at individual pieces, but software is getting very complex. Most of the issues are found when you get to the system or subsystem. The other problem with software is that it needs a lot of components to run. So you can verify an IP block in the context of some verification IP. But the real problem is when you put things together and boot an OS on it and you look at initializing IP and the direction of the IP. That’s where you get to the real issues. The problem there is that all the pieces need to be assumed to work. If you’re being pushed to work on that system because the software needs it but the IP has not been fully verified, I’m not sure the Agile approach of doing more unit tests really helps. It’s contradictory a little bit. You want it as early as possible and the software needs all the components to run.
Schirrmeister: It’s all scope-dependent. Agile for the block level works. UVM works well together with verification management and all these things to keep things together. It gives you the dashboard. But it has real difficulty in extending to the system level. Scope is an issue. That’s what we’re dealing with on the horizontal reuse. It’s so big that some of the software cannot be tested with the majority of the blocks. Then if you think about all of the re-integration of your verification environment, that’s very complex. At the block level, the Agile piece works. But at the system level, you have to make sure you haven’t broken anything.
Smith: We still need to figure out a hierarchical methodology that works. There’s a distinct difference between subsystems and blocks, and systems with software. Long ago, there were hierarchical input/output approaches, where at least when you were defining a block you knew you had to test the inputs and outputs and check to see if the functions are right. You could do that hierarchically, which defined the scope of what you need to work on. We understand that well at the block level. As you come up to a higher level, you can’t test coherency using one CPU. That’s not going to help. So you have to come up with what is the subsystem you need to analyze, what are the dependencies, and what are the blocks that need to be included in that test. And that’s how you do it in an Agile environment. You do think about things in terms of subsystems, and what’s unique about a particular subsystem and what tests need to be designed for that subsystem.
Schirrmeister: You said block, subsystem, system with software. I would argue there’s always a pair of hardware and software. Even with a block it’s block plus driver. It’s subsystem plus some of the software. And then you have SoC with all kinds of software pieces, even where some of the pieces don’t know the others exist. That’s what we’re seeing in horizontal reuse. If all these things are integrated early, it puts more pressure on the hardware guys.
Smith: Those things are inputs. When you’re talking about software drivers, that’s going to set some flag that you’re blocking to pay attention to. It’s an input. If you well define all of the inputs and outputs at the unit test level, you need to be able to verify the inputs, the outputs, and that the process was correct. The process part is the easy one because it’s self-contained.
Brophy: We’ve spending a lot of time talking about processes and tools over individual actions, which are part of the Agile Manifesto. Bringing groups and people together in ways that are not as natural on the hardware design is one of the big experiments for Agile and hardware. I look at the amount of wasted verification cycles using current tools and technologies that do nothing to advance working designs. Constrained random, which we invented in the early days, generates tons of stimulus that doesn’t tell you whether you have a working design or not. Are there other softer elements, such as linking first just to get that right? I still come back to, ‘What have we learned?’ and ‘What are we watching?’ We’ve been talking about two groups. If we continue on the same arc we’re on now, keeping the same level of interaction is not going to cure that or help it. One that might stabilize is by making those two groups into teams up front. What are the criteria by which you know you’re done? Is it code coverage? 100% coverage closure? It ran in a lab for 10 days? I subscribe that if it ran well in the lab, you may be further along to something that works in the end than trying to hit these other metrics. Those point to an area of under-exercised design, but if a design won’t be used for a particular configuration, it’s okay to under-exercise. Are we willing to change the way individuals work together? It’s worked for the software guys, and I’ve seen some efforts where the hardware guys are willing to get work done adopting those techniques.