Verification Fails To Keep Up

Design complexity may be growing faster than verification tools and methodologies are evolving. This is resulting in increased delays for chip success.

popularity

Experts at the table: Semiconductor Engineering sat down to discuss the state of functional verification with Mohan Dhene, director for architecture and design at Alphawave Semi; Andy Nightingale, vice president for product management and marketing at Arteris; Dinesha Rao, senior group director for software engineering at Cadence; Chris Mueth, new opportunities business manager at Keysight; Gordon Allan, director of verification IP products at Siemens EDA; and Frank Schirrmeister, executive director for strategic programs and system solutions at Synopsys. What follows are excerpts of that discussion.


(L-R) Dinesha Rao, Mohan Dhene, Gordon Allan, Chris Mueth, Andy Nightingale, Frank Schirrmeister.

SE: For many years ASIC verification has kept up with increasing design complexity, as indicated by first-time success numbers coming from the Siemens/Wilson Research study results. New tools and methodologies both contributed. But the last couple of surveys (see figure 1) have shown that this relationship has broken down. More chips are failing. What is going on that’s making verification more difficult today than it was just two years ago?


Fig. 1: First-pass success rates. Source: Siemens EDA and Wilson Research

Nightingale: Exploding SoC complexity is one thing we keep seeing and writing about. We’ve got dozens of heterogeneous compute elements in every design, and more recently die-to-die connections with chiplets. What we’re seeing is this rising complexity is not being matched by the complexity of the verification IP that goes around that. This is a big challenge for the entire industry, and we need the DV component for the individual elements to expand to the system level. Everywhere there’s a protocol change, or everywhere there’s a die link, that’s where we get challenges. We are seeing customer designs with thousands of interconnects, with thousands of initiators and target devices. All of these are packaged up as individual elements that arrive as goods into the customer, and they unpack the individual test benches. They run those test benches, and maybe everything reports, ‘OK, the heartbeat is fine, green, green, green, good coverage.’ Then they plug it into something else and there’s an issue. They find the corner case they haven’t previously anticipated. Maybe things happen in a different order. Then we have to try and find the root cause and fix it. The complexity issue is going up and up.

Mueth: Complexity is growing exponentially. You can graph it with Moore’s Law, the number of specs, or the data speeds — however you want to do it. It’s growing very quickly. The problem with exponential growth is that it’s easy to miss something. All you have to do is miss one thing within that exponential bubble, and your design fails. When you have a lot of interdependent physics, interdependent specs, or one electrical spec affecting another, if you missed just one thing — or some noise that’s developed somewhere along the lines that you had not anticipated, haven’t thought of — the integration will kill you.

Dhene: It’s not just first-pass silicon. The failure rates are going up, but the project timelines are being impacted, too. The main reason is design complexity. More logic is being packed in these days. For the first time, hardware is not able to catch up with software requirements, and that gap is increasing. Secondly, everybody is doing more chips. Earlier, a company would probably do one chip every one to two years. Now we see the rate growing to maybe 2X to 3X more chips in that same time period. But engineering resources are not growing, or not scaling up in that fashion. Finally, we see that the specs are constantly evolving because of the dynamic nature of the design cycle. Any changes in the spec result in a lot of downstream challenges. For example, a simple change in design has to be worked into coverage, and so on. These are really contributing to challenges and first-pass success rates.

Rao: Since the start of the AI boom, we have seen an increasing number of workloads that need to be run. If you really want to see system behavior, you don’t have a way to run the system-level test cases before tape-out. The best way of running those is by having silicon. That’s the fastest way, but using a traditional simulation methodology, which starts with architecture exploration, followed by RTL simulations, followed by gate-level simulation, and then the post-silicon validation — this cycle needs to incorporate a mixed model because the complexity is increasing. At the same time, there are different workloads that we need to run. As the spec continues to evolve, one change in the spec, or one wrong identification of a feature, or less understanding of that feature in the spec, becomes a bug that was not thought of. We see more designs that are almost at the reticle size limit, which means we are seeing more chiplets. That means there are multiple dies talking to each other, and they are executing their own workloads. With all those workloads simultaneously processing their own work, it’s very difficult, extremely difficult, to have an environment that can identify flaws in this model. That’s what we are missing today, and that is where we should focus on going forward, especially with AI workloads, which are exponentially getting more complex.

Schirrmeister: Verification has four components, and they have all been mentioned with one exception. First, software complexity. Depending on the application domain, you look at half the cost going into software. Software is increasing in automotive by a factor of 6X. Hardware complexity is second. For hardware complexity, we need to be more differentiated because verification happens in phases — IP, subsystems, chip, SoC/chiplet, and then multi-die within the system. Hardware complexity is now two orders of magnitude, or even three orders of magnitude more complex than just a couple of years ago. The other two elements are interfaces, in terms of the interfaces between specs and different interfaces, and the last one is the one that hasn’t been mentioned yet. Together with the workload execution, there is the situation that we are simply getting much more architecturally clever. If I look into the architecture space for those workload-optimized systems, I’m still digesting whether hardware really doesn’t catch up with software. Workload optimized architectures make it a lot more complicated, because you have an architecture space that has many configurations of the interconnect, many options for compute, many options for accelerations, and many options for interfaces of various generations. This is like an M times N times K space to explore and verify, which becomes really difficult because it’s possible now to do all those options, and in some cases, financially relevant to do so.

Allan: As well as all the exponential curves that we all agree on, there’s a cyclical effect in our innovation and economic cycle. I hesitate to guess how long that cycle is. Maybe it’s 15 years in some cases, or 5 years in other cases. We’re seeing a perfect confluence of events on the up cycle. We’re seeing economic accessibility to advanced nodes and economic accessibility to chiplet and 3D-IC technology that was not accessible to the whole industry until now. Software is driving the workloads, driving the clever architectures that are now possible with these economic changes. And then EDA. These are all feeding this ecosystem, and we are all trying to slow it down so that we can make money. In some cases, we need to stop and recognize that now is the upswing where we need to think differently and invest more.

Meuth: Are we slowing down?

Allen: We are being complacent about the technologies that we introduced 10 or 15 years ago. I’m a UVM guy. We should have moved on from that to the next great thing. But that’s not really the topic. The topic is, how do we recognize economic changes and innovation changes, and as EDA and IP providers, how do we prepare the industry for what comes next?

Schirrmeister: But we haven’t been complacent. We have been inventing new stuff all along. I want to defend our industry. Have we reached the point where verification is ahead of the complexity growing? No. But have we introduced new things continuously? Absolutely. When I did my first chip, we did IP verification. And then this little company in Cambridge came with this pre-designed processor. We have invented IP reuse. Things like IP-XACT. We have invented UVM. But then we also did things like the Accellera Portable Stimulus for system-level verification. And there’s so much innovation going on, which perhaps hasn’t come out yet. We have invested in test generation automation for verification. I would defend us a little bit, but yes, UVM, I hear you. I was on panels a long time ago discussing whether UVM goes from IP to system and all that. We have definitely innovated. Have we innovated fast enough?

Allan: The survey says no.

SE: The drop-off in first pass success rates was so dramatic that it can’t just be about those few very leading-edge designs that are having problems. Everybody must be having problems. Is this a sign of the other systemic problems within the industry of not having enough engineers? Companies can’t find enough verification engineers. Do semiconductor companies expect more from an engineer now than they ever did in the past?

Schirrmeister: Can we question the survey first? We did our own survey, and what we see is not quite as bad. But the trend is definitely in that direction. We also have to ask where these failures are coming from. If you look into the list of items that actually cause these issues, functional verification is still number one. But then you have maybe 20 other things. You have power, you have yield, you have performance, you have analog issues. You have a lot of things that cause a re-spin. Part of it is the first-time right definition. You have a lot of things that cause a re-spin that are not saying, ‘When I switch it on, it doesn’t work.’ You don’t consider it because you don’t meet performance, you don’t meet power, and you need to do a re-spin to make it financially viable.

SE: Are you suggesting the requirements have tightened?

Schirrmeister: Exactly. The requirements have become higher. That’s the outcome of all this complexity. It is no longer binary like it was in the ’90s, when it either worked or it didn’t. And it’s not that we didn’t care about performance and things like that, but things were much simpler. Semiconductor companies have almost failed because a chip wasn’t correct, but they were able to convince the software guys to not use certain combinations that would make the chip fail. The errata is important. I’ve been joking in the past, saying I’ve never actually seen a functionally correct chip. I’ve seen functionally correct chips within the context of the errata, of which there are many, that define what not to do. There is a hardware/software workaround. The requirements have heightened. We didn’t innovate fast enough, and there are issues. But with all the things we do now, with agents designing things, and agents now verifying things, the question is, ‘Which agent is faster and better?’ We need to really look at the heightening requirements around this.

Mueth: The number of requirements has increased exponentially. But so have the dimensions. Performance comes into play, power comes into play. You have all these dimensions you have to deal with now, so it’s a multi-dimensional problem. If your workforce has not been scaling, you have an issue with your experts. Before you would have specialists. You would have a power guy, maybe a timing guy, or an analog signal expert. You’d have these teams that swarm on this thing. Now everything’s interdependent. Something could go wrong over here, you fix it, but you destroy something on the other side. That’s hard to wrap your head around. It’s easy to not have full coverage, and the workforce isn’t scaling. So how do you deal with that? If the workforce isn’t scaling, then you can’t have a team of specialists working on your chip. You have to make each person more effective at multi-dimensional design, and that has not happened. We don’t have the technology. It’s probably where we’re going with AI assistive engineering.

Schirrmeister: This is similar to the reason why American phone numbers have seven numbers. It’s what people can remember. It’s not that we, as people, can’t expand our minds and just verify more and larger scopes. It’s really about the cleverness of how the verification phases are set up.

Mueth: An individual can’t be an expert in five different things. But you equip them with tools so they can be semi-specialists. Maybe with AI assisting an engineer, they can get by, versus having an army of 100 specialists. The gap has to be closed by automation.



Leave a Reply


(Note: This name will be displayed publicly)