Experts At The Table: Verification Nightmares

Last of three parts: Killer bugs, synchronizing TLMs with RTL and where the verification headaches have gone (but certainly not gone away)


By Ed Sperling
Low-Power Engineering sat down with Shabtay Matalon, ESL marketing manager in Mentor Graphics’ Design Creation Division; Bill Neifert, CTO at Carbon Design Systems; Terrill Moore, CEO of MCCI Corp., and Frank Schirrmeister, director of product marketing for system-level solutions at Synopsys. What follows are excerpts of that conversation.

LPE: How important is a high-level model in verification?
Matalon: If you have a reference model that is a TLM and you have a good way to find equivalency between the TLM and RTL, then why not give the TLM to software designers? For many applications TLM without timing will be sufficient for certain timing-critical tests. You also need a TLM to model timing accurately for the approximately timed level. But you can create hundreds or thousands of replicates with a TLM platform that are free. The replication is free—or almost free. For the software guys, that’s a very powerful solution.
Schirrmeister: And that’s the challenge to figure out. We haven’t quite figured out how to do equivalency checking against the TLM.
Neifert: The average SoC has tens to hundreds of blocks. If you’re starting from scratch and want to generate your TLMs, that’s a great approach. But what we’re seeing is that companies are only developing 20% to 40% of this IP internally and the rest they’re getting from outside. Who knows what form that stuff is in.
Moore: Isn’t equivalence checking hard intrinsically hard?
Schirrmeister: Yes, but from a verification perspective everything we do has to add up to less than what we do today. If adding TLMs to your software isn’t helping you to reduce the time you spend on verification, people will be hesitant.
Matalon: Allow me to disagree. First, I’m not talking about using formal methods to validate equivalency between a TLM and RTL. But inherently when you build an OVM environment to validate a block, you need a reference model. What does it mean to do verification of the RTL? It’s a comparison. We always validate RTL by comparison. You can use simulation techniques to say this TLM is functionally equivalent to the RTL that’s getting implemented. If your TLMs allow you to model the registers correctly and things that maybe in the past weren’t done, we can assemble TLMs. It will be the standard practice of every IP provider to provide a TLM 2.0-compatible model. For the re-used IP, which constitutes 80% of the design, I think this opens the door for replication of the TLM.
Schirrmeister: I agree with the equivalence verification. We’re not quite there with IP providers. But the challenge with TLM models, because you don’t have synthesis and formal techniques, it is not just an ordinary part of every design flow. It’s an additional effort. And what happens at the end is someone changes the RTL before tapeout and people don’t keep the TLM models in sync with what ends up being implemented at the end.
Neifert: They should be generating this automatically from the RTL. Then you solve that bottom end.
Matalon: If the RTL has changed without updating the reference model, then you haven’t validated your RTL in the context of the system. It’s all about bridging between the transaction level and the RTL. To change functionality, you change two lines of code.
Schirrmeister: For us on the TLM side, it’s always an investment decision.
Moore: It all comes down to economics. Why do people do something stupid like changing the RTL without changing the model? It’s because they think they’ll make money by doing so.
Schirrmeister: And if you have a set of hundreds of them it’s hard to keep them in sync.

LPE: Let’s talk about economics. Verification used to be 70% of the NRE. Is it going up? Or is it now blurred between what’s verification and what isn’t?
Neifert: It’s getting blurred. It’s as much an integration issue as anything else. You’re obviously spending more money now, but the integration task is taking over some of verification because people are using software to drive some of this. Is it a software budget or a verification budget? I don’t think you can draw that line as definitively anymore because you probably re-use some of that stuff in your software once you verify things work.
Schirrmeister: Verification definitely is going up overall. The question is where it’s done. Hardware verification has gone down and the hardware verification manager is thrilled. But the verification nightmare has shifted to software.
Moore: The classic example is a baseband processor in a cell phone. That processor contains boot code and it has to operate the USB and operate the software during mass production. If that doesn’t work you don’t have a product. And because it’s sitting in the mask, that boot routine has got to be right.
Matalon: Verification doesn’t go down as a whole. How can it go down with increased complexity from multicore and functions that are implementing hardware and software? It really depends on who the verification manager is. If I’m the verification manager and I’m confined to System Verilog and my life is to carve out verification for hardware blocks, my life is easier. Now there are off-the-shelf transactors and you can just add them in. But if you’re the verification manager who has to validate your design is correct and meeting spec at the system level, and also responsible for meeting performance and low power, then the load is not going down. And if you’re not keeping up with advanced methodologies, you will be in trouble.
Moore: And as each node comes along the absolute cost of failure is escalating.
Matalon: If you don’t validate early and catch what you call stupid bugs, or in some cases nasty bugs, you are in trouble. That’s where the shift is happening. The kind of verification people do will shift from the block level to the newer ESL space where there isn’t maturity yet.
Schirrmeister: Verification is never complete. It’s a question of when you are comfortable enough. But the sword of Damocles is always hanging over you. If you mess up the chip, it’s $3 million for a new mask in never-recurring costs. If it’s software, there’s always service pack two. But in the case of Toyota, the impact can be devasting.
Moore: The economics of Moore’s Law were such that shipping fast was imperative.
But it’s not just Toyota. There’s a strong suspicion there have been numerous glitches in drive by wire. If you look into, there are lots of situations where software problems are present. Verification is a hard problem and you have to really set up your workflow so you’re throwing every tool that’s economically justified.
Matalon: I don’t see any tool being taken off the table. It’s all methodology and which tool you use when. Not everyone is using the more innovative technologies. For someone used to waiting for the silicon to come back and then sticking it on a board and validating it, this is a huge transition. Validating by writing a model at the SystemC level and dealing with virtualization is much different. You need to know when to use the tool, use the right tool as early as possible, and take advantage of what you develop earlier during the downstream phase. If you use TLMs early, then you re-use those TLMs when you verify. And if you use silicon validation, which is not going away, then use all the tools that you have used before for reference debugging and running system-level scenarios where you replicate some of the problems you see in silicon by validating your original assumptions. If you have to debug a problem on your silicon, that’s very hard. You can use emulation as a reference for debugging. You can use a transaction-level model with timing and power information to compare what you have received, where you made the mistake and how to fix it.
Neifert: That TLM framework can be used throughout the debug cycle, which is where you get the dollars to justify it. Initially you can look at it as an incremental expense. But when you look at how it scales and you realize you don’t need to generate an independent model for this and an independent model for that, that’s where the real value is.

LPE: What are the bugs that are fatal? Are they power? Design?
Neifert: If you look at the stuff that makes the news, it was the division problem in the Pentium. It was a pure hardware bug. If you applied the verification tools of today, that would have been caught. Today it’s not just hardware. Most engineers think there’s some aspect of software.
Schirrmeister: The fatal bugs are the ones that cannot be corrected in software today and which end careers and are not reported on. Those are the ones you never read about.
Moore: The fear for most companies in our space are the bugs that kill companies. What causes those? Mask spins. And what causes those typically are system-level problems. You go to hook it up to a critical system and it doesn’t work. And it’s down at the RTL level and it’s not accessible because of all the protocols that are running too fast to make it accessible.
Matalon: If I look at a functional bug that can be overcome by software, it’s not a fatal bug. A functional bug that cannot be fixed by software is fatal. But the reality is that if you have a performance issue or a power problem, where does it stem from? It’s probably because you’ve validated hardware in isolation, but not in the context of the software. Those are the things that are fatal and scare users away. We have a customer that designed IP and wanted to find out if it would meet performance in the context of the system. They couldn’t simulate at the gate level so they abstracted to the TLM. There are ways to fix functionality sometimes with software. But I’m not aware of any way to fix a design that hasn’t met performance or power in the context of the software by fixing the software.

Leave a Reply

(Note: This name will be displayed publicly)