Creating Better Models For Software And Hardware Verification

Experts at the Table: Rethinking approaches for more complex systems, ISAs, and chiplets.

popularity

Semiconductor Engineering sat down to discuss what’s ahead for verification with Daniel Schostak, Arm fellow and verification architect; Ty Garibay, vice president of hardware engineering at Mythic; Balachandran Rajendran, CTO at Dell EMC; Saad Godil, director of applied deep learning research at Nvidia; Nasr Ullah, senior director of performance architecture at SiFive. What follows are excerpts of that conversation. Part one of this conversation can be found here. Part two is here.

SE: We’ve been heavily reliant on models in the past. Can we get by with those same kinds of models in the future, or do we have to start rethinking everything?

Garibay: For chiplets, everybody has to be on the same page. They all have to be interoperable. In the industry, I’m not aware of any merchant product that has silicon from different vendors in one chiplet-ish type of package, other than HBM memory. It’s kind of a business issue. We’ve never solved the problem of what happens if the integrated package fails and who pays for it. The lack of some form of sign-off is an opportunity for the interconnect IP vendors and the EDA companies to step in and become some form of honest broker and say, ‘Okay, I’ll log your interface, validate some VIP for it as a service, and have that available for other companies.’ So there’s an opportunity for a third party who’s able to validate these two things should work together. The whole liability issue is what has really constrained the market for chiplets up until now. I’ve tried three times to get something like this off the ground and talk to other companies. Everybody’s excited until the business guys and lawyers get involved and say, ‘Well, how are you going to solve this?’ That’s a real problem. Popping it up a level, where we get to the models, we have to figure out how to define these models in the same way we have for generations. There’s probably a big opening here for formal verification, to be able to say all these chiplet interfaces have to be formally verifiable. And some EDA company that’s doing formal can step in and say, ‘Yep, we, here it is.’

Ullah: I’m a big fan of emulation. However, is still too expensive. It costs hundreds of thousands of dollars. We made a lot of progress in a previous company, where we had to build Android software and run a bunch of applications — including including stuff with Java on them — to make sure things were working. That’s an extensive amount of software, and it is very hard to do that on a purely functional basis. So I was a big fan of having either a functional model or a performance model in a hybrid scenario, and we used Mentor’s emulation platform and were very successful with RTL that was coming from a specialty chip, and then worked with a third party to make sure an OS got booted — Linux or Android. And then we worked on getting the software running. That was very successful for performance, power and functionality. We need to leverage that kind of hybrid model a lot more, where we can focus on functional performance, RTL, and mix and match of these things. But not every company can afford to buy an emulator and set all this up. So if the companies can work out some method like Amazon does, where you can get time on emulator to put your system in and mix it up with your software, that is where we have to go for the future.

Godil: Emulation was such a game changer when it came to verification methodology. Having the ability to use your software suite as a testing platform was powerful, especially when we built more and more complex chips. That has an interesting tie-in with models as we build new chips with new functionalities. Being able to leverage the software for them is essential. The software team needs to go develop their software, so that once your hardware is ready, they have the software ready to go. There’s this sort of dependency, and one way to remove that depenency is to provide the software team with higher-level models, on which they can develop their software. There are different levels of models. Functional models may have a lot of details, but the software team doesn’t need that kind of fidelity. There are lower fidelity models that are much faster, and the software team can do all of their development on those. We’re going to see more of that to enable software to shift left, so as as soon as we can get it on the emulator platform we can run the software and have that ready to target new features.

SE: What’s different about these models?

Godil: Every model that you build has a big cost to it. Somebody has to go build it, and then you have to maintain it, so you have to be very careful in finding out what the value is and trading that off with the cost. There is a lot of value in developing newer models that can pull in software, and we’re going to see more of that as we go forward. As you build more and more hardware features that need targeted software, the software teams are going to need some kind of platform for their development before the design is ready. Functional models take too long sometimes, especially high-fidelity functional models. There are a lot of changes, so you can’t really build the functional models too far ahead of the designs. There are all these code development aspects. But the software team doesn’t really care about all those details. They need a higher-level, if you will — an ISA simulator, if you’re thinking about processors. You have these ISA simulators that are developed very early on that don’t have any of the microarchitecture details or a lot of the functional details that the the other teams care about. You’re going to see that for the SoC as well. We’re going to borrow those concepts for processor development, and you can apply it to other other hardware designs, as well.

Ullah: For models from processor work, I agree we need a high-level ISA model for software development. We also need a higher-fidelity functional model for doing verification more accurately, and we need a low-fidelity cycle-approximate model, maybe functional with table-based timing, and then a high-fidelity performance model, cycle-approximate model, to do microarchitecture work. And then those should be able to tie to RTL or SystemC, or other models, in a hybrid fashion.

Godil: What I’m arguing for is a low-fidelity fast model, like the ISA simulator, but for the rest of the design. That’s what’s missing right now. So much of our hardware designs are now exposing things to software, and the software team really needs something to be able to go target those features. Simulators are not new. We know know how to build them. What we don’t have is that kind of an abstraction-level model available for the rest of our hardware chips. We’re going to see more and more of that as you try to pull in the software for testing.

Schostak: One of the things formal forces people to do is think more about how to actually define interfaces and everything else. With simulation and emulation, it’s too easy to say, ‘This RTL is good enough. I’m now going to run some tests and see whether it works. If it works, that’s fine.’ I’m not saying that’s standard practice. But it’s a very easy trap to fall into, whereas the formal tool is definitely going to tell you if there’s things that aren’t fully tied down in your specifications. That provides a good opportunity for the well-defined interfaces to communicate across boundaries on the model inside. We’ve talked about time to market. You’re bringing forward the RTL schedules. And then, the window you’ve got in which to build the model while it’s still useful is shrinking down because you haven’t got RTL that can run in an emulator. If I’ve got a high-level specification of what I’m going to build, what can I actually generate from a high-level specification that might be able to generate some aspects of my ISA model? That will give that person a head start, as well.

Garibay: A lot of what everyone is talking about is very processor- or SoC-centric. In AI chips we have a C model. But in order to do a full image analysis, one image and a C model takes teraOps. That’s one of the reasons I believe that the verification of true AI chips is different and slightly more challenging. If you’re going to be verifying at a level where you’re exposing each multiply and each add, the number of operations is so large that, in Verilog, we can’t even do an one image in a reasonable amount of time. So you’re reduced to saying, ‘Okay, let me get this corner.’ Without an emulator, we’d be dead. We couldn’t, couldn’t develop our compiler.

SE: In the past, when we were developing advanced-node chips, they mostly were going into consumer devices or servers. Some of the most advanced designs now are being done for automotive and robotics, and they’re supposed to work for more than a decade. What can be done from a verification standpoint to make this possible? Is verification shifting left and right?

Garibay: Safety and security historically have been perceived more as architecture, and then architecture defines tests that can be used to verify them. I’m not excited about riding around or driving cars that are based on 5nm. The functional safety side of this is super interesting. It really forces a decomposition. You have to break everything down to, ‘This is the safe part, this part is not safe, and this is the interface between them. This is how I judge those things.’ And then for AI, there is no functional safety definition. How do you judge something that is 72% accurate as functionally safe? Some vendors are implementing a scheme where they have a processor from Company A and a processor Company B, you run them both in parallel and you vote one off. It’s like avionics, where you have a voting system among three results. This is a huge, huge open space, and it’s not clear how we even structure these problems such that they are verifiable, or amenable to techniques for verification. And it’s constraining how we move forward with it.

Godil: If you work in the ISO 26262 standard, one thing that becomes apparent really quickly is the need to document every single step of your verification flow. Every time I hear documentation I think of overhead. It’s more bureaucracy that engineers have to deal with. But my hope is that we can build tools eliminate that overhead, and I’m actually a bit positive about it. The more we invest in documenting, the more it will help with the other side, where we want to instrument our processes to get the data so we can feed it into and train intelligent systems. Formalizing our processes and documenting every step that we do will take us step in that direction. I view functional safety as helping with that effort. It aligns well with the needs for both things. But we do need better tools to help eliminate a lot of the overhead that those those functional safety flows imposed on the teams.

Schostak: With functional safety and security, there is that documentation overhead. If a bug escapes out into the field, you can’t really fix the hardware. The hardware already will have gone through the certification process to be functionally safe, etc., and that takes years to go through. So you can’t rely on bugs being found in field. It’s not like you can re-spin it and fix it within a year. It’s not that simple. It will still require several years of research to find the hardware and the software that runs on it. That forces a higher level of assurance that there isn’t anything that is going to affect functional safety-critical or security-critical systems, which are generally standard things like ECC or redundant devices.

Ullah: Getting better tools is important, but really it depends on testing, testing, testing. I worked on the Motorola risk at 88100, which was in the nose cone of a Patriot missile, and every time a Patriot missile missed a Scud during the Iraq War I wondered if it was some software I wrote. The F18 had an 88110 from Motorola, too. When an F18 crashed, we had to figure out why. The cause was a nested interrupt controller. I had to debug that.

Rajendran: One of the challenges we didn’t talk about from a tool point of view is that they are all evolving with ISA standards, and there are new things being added. But from an organization point of view, the organization itself isn’t there. They don’t have frameworks and other instances in place, so if a bug is found on a product 15 years from now, they cannot simulate it and say this is a bug. There is no methodology in place. Can I still use Synopsys or Cadence tools that I used 15 years ago? Will it still be supported if I have issues? Those are unanswered questions. And for all the verification that I did, now I have to retain data for 15 years. So there are data explosions within organizations. Some organizations are thinking ahead and looking into it, but others are still clueless at this point. They’re focusing on the design part of it, but they’re not looking at what they’re going to do if a chip comes back.



1 comments

Peter C Salmon says:

I’m not a testing professional, so my idea may be flawed. However, after reading the above comments, I think the following makes sense. Provide a small test chip in the system. Run the application software on just-built systems, attempting to exercise all the functional corners. Capture the behavior by monitoring one or more system buses, sampling only at strategic points in the program. Test multiple systems and compare the results to find a “known good system”. Store the behavior of the known good system as test vectors. The software to do this should be less than for modeled approaches. Continue over time to monitor the buses. When a change occurs, isolate the cause down to one of the chips, or a segment of one of the chips in the system. Maintain a map of good and bad devices, and swap out any newly failed device or segment. This approach is conceptually similar to one that has evolved for flash memories.
Any takers for this concept?

Leave a Reply


(Note: This name will be displayed publicly)