Verification Scorecard: How Well Is The Industry Doing?

The functional verification task keeps growing. How well is the industry responding to growing and changing demands?

popularity

Semiconductor Engineering sat down to discuss how well verification tools and methodologies have been keeping up with demand, with Larry Lapides, vice president of sales for Imperas Software; Mike Thompson, director of engineering for the verification task group at OpenHW; Paul Graykowski, technical marketing manager for Arteris IP; Shantanu Ganguly, vice president of product marketing at Cadence; and Mark Olen, director of product management at Siemens EDA. What follows are excerpts of that conversation. Part 2 of the discussion is here.

Semiconductor Engineering sat down to discuss how well verification tools and methodologies have been keeping up with demand, with Larry Lapides, vice president of sales for Imperas Software; Mike Thompson, director of engineering for the verification task group at OpenHW; Paul Graykowski, technical marketing manager for Arteris IP; Shantanu Ganguly, vice president of product marketing at Cadence; and Mark Olen, director of product management at Siemens EDA. What follows are excerpts of that conversation.

SE: The functional verification survey conducted by Wilson Research and Siemens EDA has pointed to some disturbing trends. Increasing numbers of problems have been showing up in analog, and now the number of designs with first pass silicon success is dropping. It would appear that verification tools and methodologies are not keeping up. What is going on?

Lapides: As an outside observer of the big three, there has been a real focus — especially over the last five years — on the big boxes for verification, on the emulators. They have their place in the verification process and verification methodology, and that’s great. But they are the high-margin product for these companies, and there’s an emphasis on selling them. That emphasis may take away from a more balanced verification flow that covers everything. I don’t have data to back that up from a verification perspective, but I certainly know from a sales perspective that’s where the focus and energy is.

Thompson: I would like to know the types of bugs that are causing these re-spins, because from my view, verification is typically done hierarchically — first at the block level, then maybe a super block level, and then the system level. I believe we have a pretty good handle on the block-level verification, using simulation and formal. The definition of a block used to be 10,000 gates. Now it’s closer to a million gates, and that has been scaling. System-level verification might be where we are not doing a such a good job. The big iron emulators are a good approach for doing that system level verification. We lack, or don’t use, the metrics that we have at the block level, at the system level. At the block level we have functional coverage, code coverage, assertion coverage, cones of influence, all that great stuff to measure the quality of the verification. But in emulation, where we put all the blocks together and integrate it with the software, we run the tests without a lot of coverage. That’s one reason why bugs are creeping in, or at least one key aspect. When doing the integration of blocks to form the system, we’re not applying any metrics.

Ganguly: First, a comment about analog. I’ve talked to a lot of customers, and everybody will talk about problems with analog design, but nobody wants to put real money down on solving that problem. I don’t think it is a really big problem. The analog bugs that creep in are gaps in modeling, where the analog behavior isn’t correctly encapsulated at the digital level. If I look at a modern product, the analog/digital interfaces are in PHYs, in SerDes, around the PLL boundaries, and so on. It’s a finite and relatively small amount of stuff. There’s always been a chasm between analog designers and digital designers, and many times they shortchange the modeling. Going on to the hardware piece of it is more interesting. I disagree with you on the need to run coverage on hardware. That’s about the last thing I want to do. Run as much content as I can, as fast as I can. I want to run apps, benchmarks, games that a user will run on the phone. If I look at network switches, people will run synthetic content using test generators, but eventually they want a speed bridge to plug it into the wall, because at that point, you’re going to see traffic. You will never be able to program this with network packet generators. So the idea is to run as much realistic content as you can early on. Coverage at the block level is fine. One company had a bug that escaped into silicon, and this happened despite a very good verification technology. It happened after a sequence of events that would take weeks or months to simulate. You will see a concurrency bug, with certain traffic patterns that may not happen for weeks on an emulator. But it happens in four or five minutes when somebody plugs it directly into a PC. You really want to run as much content as possible. There is a place for emulator, there is a place for prototyping, and for virtual platforms. The overall strategy is very important.

Graykowski: When you look at the block level, for the most part, that’s pretty well handled, at least for the smaller blocks. And all the traditional techniques, writing UVM, doing functional coverage, code coverage — that seems to scale well. There are two things I want to point out. In a past life when working on PCI Express, which I hate to call a block because it’s so big, there was a lot of pain. When you run the test suites, they are very long sequences and they involve a lot of randomization. When something fails, it can be so hard to figure out what happened. Just trying to diagnose it, that is an area for improvement. We need smarter debug tools. How do you bring smarts into the debug capability and be able to abstract that out so that you can debug something quicker? Secondly, I’m seeing more system-level issues. When you are putting big chips together, you can really beat it up at the lower levels, but when you move to that system level, it’s challenging. I’m talking more from the simulation perspective. There are so many combinations. You’re focusing on data flow, you have to make sure that things can interplay, there are issues with timing between blocks, different clock domains, different power domains, all those types of things. And that’s an area where I see folks just cannot spend enough cycles to make sure it is perfect.

Ganguly: That’s an interesting point. Subsystem assembly around the fabric, and being able to do both functional and performance tests, is a very useful intermediate step to prevent bug escapes. Historically, when they do an SoC assembly, they make sure that IP behaves, and they are able to read and write. Then they just dive into the deep end of things. But this intermediate step can be very reasonable way to define products, which are essentially subsystems around a fabric. We can define the CPU and memory complex, and then you can do recommended traffic, and you can measure performance. You can even do some stress testing in the area of bugs, or interesting scenarios, PCIe and some other peripherals as well. So this is definitely an area where I think we can make things easier for all, or many, of the building blocks. The other point is very interesting. If you look at debug, 5% of people are able to debug quickly. And then the productivity falls rapidly. How do we raise that level? Debug is actually the last frontier, making something that happened a half-million clock cycles ago and then went through a bunch of events and eventually caused something to fail. How do we debug that?

Olen: I’d like to tie into the debug part, but I want to start with your original question about analog. I don’t think we claim to know any of the answers completely. But in analog, if there’s anything that’s more CPU-intensive than digital RTL simulation, it’s SPICE — analog simulation, and there are only so many CPUs in the world that you can leverage. And it doesn’t really scale. The reason I want to tie this into debug is that this is an area where machine learning can provide 10X, 100X benefit in analog simulation. We’re starting to see this, and debug tied into that. We are talking to customers and doing research that indicates that debug is an area that’s ripe for machine learning types of benefits. You are doing something over and over again, making ECO changes to a design. There’s plenty of opportunity to train it and save a lot of time. On the emulation side of things, the comment was made about the big three pushing emulators, but here is the real question. Are we just responding to demand, or are we pushing it because it is a lucrative business?

Ganguly: We are responding to demand. It’s very clear that nobody is going to write such a large check unless they are absolutely convinced.

Olen: When going from block level to system level, maybe they don’t want to do coverage in the exact same way at the system level. No one wants to take the cover points from my block level and run it at the system level. But I do want to be able to take advantage of the infrastructure. If there are coverage models showing things I’ve done in block-level simulation, or formal verification, I want to be able to re-use that information at the system level. And from what we’re seeing in the market, I don’t think anyone’s solved that problem really well. In the area of emulation, our customers are telling us they’ve got 10 times as many software developers as hardware developers, or more. Emulation and FPGA prototyping serve multiple purposes. One is to do system-level verification, where you want to re-use as much as you can from the block level. The other area is in software development, to be able to do your software development in parallel with your hardware development, not having to wait until you’ve got silicon.

Thompson: Absolutely, and if you can leverage some of that software to verify the silicon that’s been designed, that’s even better.

Olen: Now add all of that in, and you want to know why two out of every three designs require a re-spin?

Thompson: One of the big ones is because of the use model — getting all your post-silicon content clean before you’re taping out to be manufactured. Recently, we’ve gotten an increasing number of design companies, and while some have been doing this for many years, other companies are picking up on this idea. The idea is that when your silicon comes back and you have a failure, you know what is happening inside the silicon because of the debug infrastructure you’ve put into silicon, to test the silicon, actually works. This is a very important exercise, and as more and more companies figure out they need to do it, we will start seeing the number of re-spins going down. You will get back healthier first silicon, where you can go through your first set of bare metal validation very very quickly. This process is very important.

Lapides: You were talking about the number of cycles it could take to expose bugs. This is a place where virtual platforms can really help. To illustrate my point, a customer was trying to bring up Linux. They did get to the Linux boot prompt, and that’s great. But they continued using the virtual platform and running a lot more tests – not just applications on Linux, but running multiple applications. They found a problem after a few billion instructions, and there was a problem with the preemptive switching, task switching in Linux. But you had to get billions of instructions out, you had to be running multiple applications. So there’s a quality of testbench issue here at the system level that comes into play.

Ganguly: Virtual platforms absolutely have their place. And then eventually being able to take that virtual platform and create a hybrid, where you have a piece of IP that is the actually RTL, in either an emulator or prototyping form, and connected up. This adds a level of realism. The virtual platform are very good. The piece that they don’t capture is some of the lower-level interaction, like interrupt sequences, very low-level hardware that’s abstracted out by the software model.

Lapides: The flip side is there are teams that are being successful. There is a company building an advanced AI accelerator that plugs into the data center. They have a very rigorous DV methodology for their silicon. They’ve got a very rigorous virtual platform methodology for the software. And they got silicon back and they were up and running within days. So good teams, with a really rigorous methodology, a plan that’s well documented, and with the right metrics, can be successful.

Thompson: OpenHW deals with the open-source community a lot. I see a lot of straight-out naivete within the open-source community about verification. We shouldn’t be too smart, because this is the same naive position that the whole industry had 30 years ago. Back then we would tape out a 10,000 gate designs with tons of bugs. We now can do 10 billion gate designs that actually work the first time. So the established players have paid the price, learned their lesson, gone back and bought the emulator, and are now successful. They understand what it takes and are willing to spend the money. The new entrants haven’t learned that yet. They’re getting a very accelerated crash course.

Related
How Mature Are Verification Methodologies?
The industry is evolving, with new players, new problems, and new challenges. Is this why verification appears to be struggling?
The Next Incarnation Of EDA
Is there about to be a major disruption in the EDA industry, coupled to the emerging era of domain specific architectures? Academia certainly thinks so.
Who Does Processor Validation?
Heterogeneous designs and AI/ML processing expose the limitations of existing methodologies and tools.
Customization, Heterogenous Integration, And Brute Force Verification
Why new approaches are required to design complex multi-chip systems.



Leave a Reply


(Note: This name will be displayed publicly)