Depending on design complexity, memory allocation or a host of other issues, a number of approaches simply can run out of steam.
Verification efficiency and speed can vary significantly from one design to the next, and that variability is rising alongside growing design complexity. The result is a new level of unpredictability about how much it will cost to complete the verification process, whether it will meet narrow market windows, and whether quality will be traded off to get a chip out on time in the hopes that it can be fixed after tape-out with software patches.
Even in the best of cases, verification is never completely done. But there are a number of factors that can bog down the verification process. Among them are an increasing number of requirements, some of which are poorly defined, which makes it difficult to provide closure. And in many cases, schedules are becoming unrealistic, and interactions between design and verification teams are not always well thought out.
“What’s breaking verification today is the growing complexity, and it’s not just the growing complexity in what everyone says about chips getting bigger and bigger,” said Harry Foster, chief scientist verification at Mentor, a Siemens business. “That’s really not the issue. The issue is that we have these exploding requirements that are happening. For example, years ago we used to verify the design and we’d focus on the functional aspect of the design. That has changed in the sense that we’ve added these new layers of requirements. Now it’s not only the functional domain, it’s the clocking domain, it’s the power domain. We’ve got security and safety on top of that. And we’ve got software. In order to address that you’ve got to get your hand around that in terms of the planning process.”
Fig. 1: Verification structure and engines. Source: Mentor
That’s not so simple, though. Despite an industry-wide effort to shift verification left, the logistics alone are extremely complex. It requires new levels of planning at the system-level, rethinking of methodologies and flows, and a much deeper understanding of what can go wrong across multiple different groups.
Frank Schirrmeister, senior group director for product management in the System & Verification Group at Cadence, pointed to two main axes for verification—speed and completeness. Speed is obvious to design teams. Completeness, on the other hand, may not be apparent until months or years after a chip has been taped out. But both are potential breaking points.
Speed is becoming problematic because even with smaller designs and larger capacity, more elements need to be simulated. Throw in mixed-signal circuitry and functionality and the execution time rises significantly.
“It’s not that you just put in your full chip and see what it does,” Schirrmeister said. “You have the capacity pieces. But while it has become better, there is a natural tradeoff when you want to see the dynamic portion. That’s where simulation comes in for how the design actually responds. You have to scale that with capacity, and it gets slow. Software also breaks it because you want to see things happening at a time which you would never reach in simulation. That’s when you switch to emulation or you parallelize simulation more.”
Emulation and parallel simulation add necessary capacity, and FPGA-based prototyping adds speed when those cannot keep up. But the bigger problem may be figuring out when to sign off on verification, because it’s impossible to check everything in a complex device.
“You can continue verifying and look at it from all angles,” Schirrmeister said. That’s the myth of ‘first time right’ silicon, which needs to be renamed ‘first time right silicon with a set of errata of what doesn’t work and a set of proper software routines to work around the issues.’ We sometimes use the term ‘congruency’ to basically say, ‘Between the engines, it’s congruent.’ Well, there are catches to that, too. A purist will tell you that 100% congruency doesn’t exist, and he is probably right. For example, in simulation and emulation, you switch from event-based to cycle-based—four-state to two-state—so by definition they can’t be congruent. Sometimes even between simulator versions algorithms change and you won’t get “purist congruency”. Instead we have what I would call “practical congruency”, users have to be aware of the exceptions. The interesting thing is, if congruency means the same behavior on both engines, there are actually cases where you want it to break. We have cases where emulation and FPGA prototyping come to different results, which is by design. It’s not all that rare where a design works in emulation but does not work in prototyping. There can be a multitude of causes for this, such as the user’s software didn’t initalize part of the memory correctly while in emulation that was shielded by different default values. Figuring that out can be a challenge in itself, and different engines find different issues like in this case, software issues.”
Time is money
Verification always has been the most time-consuming and expensive part of chip design. But even with faster tools and more capacity, time to sign-off isn’t shrinking noticeably, particularly at advanced nodes with complex designs.
“The reason behind that is the unbounded task,” said Rajesh Ramanujam, product marketing manager at NetSpeed Systems. “You’re chasing after the end goal, but you really don’t know what that is. It’s unbounded because as you get closer you feel like there’s something more to be done. The industry has tried to solve this challenge. Previously there were directed test cases to enumerate every possible scenario that you could because the systems were so much simpler. And then we moved to a time when that got complex and we started using random test cases. Then there was a mixture of directed-random, where the architects had some directed test cases, but would still randomize around them. But that doesn’t really solve the problem. It’s still a much bigger space than what you would think with so many more IPs being integrated and the dependencies between them. Then what came in were things like formal verification, which is a way to verify the functional correctness in a much more mathematical way. So instead of a designer or verification engineer having to enumerate every possible scenario, this was a much more mathematical way to approach it. But the unbounded challenge is one of the biggest that we see in the industry, and that breaks the verification.”
Fig. 2: Complexity continues to grow: Nvidia’s 150 billion transistor 16nm Pascal chip. Source: Nvidia
Chirag Gandhi, verification manager at ArterisIP, points to several issues that can draw out verification and make it less reliable:
1. Detailed specifications are not available or too ambiguous. Interpretations of RTL engineers and verification engineers can vary widely, which causes a lot of thrashing. Detailed specifications are required, and they must be kept up-to-date during the lifecycle of a project.
2. The design is too complicated. There are way too many corner cases to take care of, and this ends up causing a bug farm with not much confidence that all bugs have been found ever. To remedy this, verification engineers are involved early in the design process to make sure that designs are done in a way to increase verifiability, decrease the number of corner cases, and force simplicity of designs with a low number of ‘special cases.’
3. Designs don’t have well-defined interfaces. There is not enough visibility into the design to for verification to plug into. This can cause debug failures and can lead to a blind spot where a verification engineer is not confident that all cases have been exercised.
Not all of this can be determined prior to verification, though. There are always new requirements that are added during the verification cycle. Dealing with them becomes more difficult because not all of the results from one verification technology are available to another. And it’s far worse in markets where safety is involved, such as automotive, industrial and medical.
“In the next five years, the customer pressures coming from automotive, especially with ISO 26262 compliance, will exercise pressure like we have never seen before,” said , CEO of Real Intent. “At the end of the day it’s about resources and schedule. You can only do so good a job in that amount of time with that many resources. You will cut corners, you’ll take risks, and it’s never perfect.”
One of the less-obvious casualties of complexity is that valuable information gets lost. That can slow down the verification process in unexpected ways.
“Somebody knew something, they knew a fact somewhere,” said Drew Wingard, CTO of Sonics. “One of the reasons I don’t like it when my designers create their own testbenches, which the verification people then throw away, is that there’s knowledge in there that the designer had about their design that the verification folks don’t end up benefiting from. So everything that we can do to essentially re-use the knowledge makes it more additive. Many of the bugs that we track down are ones that were covered at one point in the system. So it’s all about trying to build systems that make it more natural and easier for the handoff of this knowledge to be continuous, without gaps. This is also why the process and methods matter. There’s a process, and there’s a methodology, and there’s some scaffolding. And whether the scaffolding comes from the verification people, whether it comes from the designers, or both, it doesn’t matter. They both need to be able to interact with it well to keep it fluid. That’s the most important part.”
Problem solving
There are a number of ways to reduce time to sign-off. ArterisIP’s Gandhi said it’s critical that specifications are clear and that verification testbenches are developed alongside RTL so bring-up of both is done together in a tightly coupled loop.
“Verification engineers should code functional coverage properties as they develop testbenches and run code and functional coverage early and often,” he said. “There also needs to be strong check-in qualification criteria for every design, and verification check-in to find any new bugs as early as possible. And verification requires diversification, doing formal verification along with simulation-based verification.”
Companies also should do a post-mortem at the end of the project, said Mentor’s Foster. “They’re going back and looking in analyzing what worked, what didn’t work, what do we need to change next time around? It’s that continual review to improve process. The problem I see with a lot of projects is that the first thing they think about is the ‘how’. ‘How are we going to do this?’ Simulation, formal, whatever. It’s the ‘what’ that needs to be answered. ‘What do we want to do?’ And part of that what is identifying what features we need to verify before you talk about how you are going to verify them.”
Context counts
NetSpeed’s Ramanujam noted that one of the key missing pieces in a number of projects is a detailed understanding of the functional dependencies between various IP components that are included in a design.
“When you start integrating all the IPs together, even though every IP is formally verified and sound from a functional perspective, there are dependencies between them that can cause deadlocks,” he said. “You can think about verification as two different parts. There is functional verification, which makes sure that every IP functions correctly. Then there is verification to make sure that every IP in the system is doing its job correctly. But there can still be a deadlock where the dependencies have not been taken into consideration. Even if you look at it from the SoC perspective, it would still make sure the SoC is only sound functionally, not from the dependency standpoint because it doesn’t have a concept of time, such as IP1 is waiting on IP2, and IP2 is waiting on IP1, and you get into this loop.”
Once there is a deadlock, it is time consuming to figure out exactly what caused it and where it came from. “That takes hours or days—if it is even possible to find out where the deadlock was—because there is no way to debug it when everything is deadlocked. The reason you don’t hear about deadlocks a whole lot is because nobody wants to come out and say they had a deadlock. Then it’s like, ‘What kind of chips are you building?'”
All of this adds to delays and costs money.
“You could do functional verification by using random and directed test cases, use multiple cycles and hours of emulation, but it gets expensive,” said Ramanujam. “This is why the industry went to formal verification, which is a mathematical way of solving it. Similarly, we need something for deadlocks. The only way to solve the deadlock is by having direct and random test cases, but again, the problem is still unbounded just like functional verification. The deadlock verification is still unbounded. What you need is a way to approach the deadlock also in a very mathematical way so that it’s not unbounded anymore.”
Conclusion
The design industry has been battling growing verification time and complexity for years. This is hardly a new problem. In the past the focus was on getting the job done in less or equal time at each new rev of Moore’s Law. But as more functionality continues to be added into designs—from multiple power domains that need to be switched on and off to a variety of features in a device that support a growing list of I/O protocols and specifications—it is becoming more difficult to be satisfied that a design has been fully verified within a tight market window and at a reasonable cost.
As an expert in simulation I can say that tools based on Verilog are just too long in the tooth, as is RTL as a design methodology. Mixed signal simulation isn’t necessarily slower, but doesn’t seem to be available (it’s not the same as fast-spice), and begs the question: what does a Verilog simulation actually verify – it’s certainly not power and timing when you use DVFS or body-biasing. Emulation certainly doesn’t cover any mixed-signal issues either.
We could have moved on to software methodologies like CSP years ago for doing design in general – https://en.wikipedia.org/wiki/Communicating_sequential_processes – but which of the EDA companies really wants to make your life easier?
You bring up some excellent issues, Kev. Let me look into this further and post back here when I find out.