Widening gap in the verification flow will require improvements in tools. AI may play a bigger role.
Verification engineers are the unsung heroes of the semiconductor industry, but they are at a breaking point and desperately in need of modern tools and flows to deal with the rapidly increasing pressures.
Verification is no longer just about ensuring that functionality is faithfully represented in an implementation. That alone is an insolvable task, but verification has taken on many new responsibilities. Some of them come from technology advancements that create additional issues, such as thermal. New application areas, such as automotive, add demands for safety and security, as well as a dramatic increase in parametric issues that go way beyond simple power assessment. On top of that, the chip industry is approaching another inflection point, this one triggered by the migration to 2.5D and 3D packaging technologies.
Existing verification tools and methodologies were developed 20 years ago. Since then, there have been only incremental improvements in tool capacity and performance, while design sizes have increased rapidly. And while Portable Stimulus — an Accellera standard created to decouple verification intent from the engine it’s executed on — provides some relief, adoption has been slow and comprehensive flows are lacking.
In addition to the technical aspects that contribute to increased strain on the verification process, there is a human aspect that needs to be considered. Teams are expected to do more in less time, and the industry is constrained by a talent shortage.
Approaching limits
The tools in use today were developed when blocks were much smaller and system sizes were similar to a block today. It never has been possible to say that verification is 100% complete, meaning that groups have to carefully decide where to place their efforts and where they are willing to take risks.
“Design complexity is increasing like never before due to the AI/ML revolution adding a new dimension to the type of designs we are building,” says Ashish Darbari, CEO of Axiomise. “These systems have stringent power, performance, and area (PPA) requirements. The adoption of better processes and advanced verification methods, such as formal, are not catching up. The industry is still very heavily reliant on stimulus-dependent, incomplete dynamic simulation methods, which not only allow easy-to-catch bugs to leak to silicon, but also have no chance at catching complex bugs that manifest due to concurrent interactions in deep-state machines in single- or multi-clocked domains.”
To make matters worse, IP is increasingly modal. “A block might have 1,500 specification items,” says Chris Mueth, business development, marketing, and technical specialist at Keysight. “A lot of them are interdependent and tied up with operating modes, but you also have different voltages, different temperatures and things like that. In a 6G module, you have a myriad of modes and bands that you’re transmitting out of, and it’s all interdependent. They are hitting the limits of what they can do from a frequency, bandwidth, data transfer rates perspective. You might think you have the design licked, but you can still miss one of the modes. It could end up being a problem. Even today with digital, if you’re not hitting performance requirements, you have a fail. Everything has become a performance simulation.”
Sometimes parametric failures are allowed to escape. “An increasing number of failures are soft failures, sometimes called parametric failures,” says Marc Swinnen, director of product marketing at Ansys. “The chip works, but it was supposed to run at 1.2 gigahertz and it only gets up to 1.0 gigahertz. When you look at any serious sized chip, the number of parasitics explodes into the hundreds of thousands.”
That increases the risk of failure. “When verifying an IP, they will ask about the context in which it is going to be used,” says Arturo Salz, fellow at Synopsys. “They cannot afford to verify all the permutations that are possible. Instead, they wait for the system to be ready and defer much of the verification to the system level. That’s usually a mistake because IP-level bugs are hard to find at the system level. It’s a much bigger problem. With multi-die you will not have that option, because that chiplet IP may have been manufactured, and you must verify it and test it before you start to integrate it into the next system.”
Beyond limits
Constrained random, great advance as it was when first introduced, is struggling. “I often use the analogy that constrained random is like a pool cleaner,” says Synopsys’ Salz. “You don’t want to program the shape of the pool. While it is random, the constraint is to stay within the perimeter of the pool. Don’t go up the walls. And yes, this is inefficient. It will go through the center of the pool many more times than the corners, but given enough time, it does cover the whole thing. Sticking with that analogy, would you want to sweep the Pacific Ocean? No, it is too big. You need to pick the right sized methodology that fits. At the block level, you can effectively deploy the methodology. The same for formal. It is probably not going to have the capacity to do formal checking at the system level.”
It is not only the wrong sized approach, but it is also utilizing a lot of the most valuable resource available — people. “It is hardly surprising why so many bugs get missed by simulation testbenches,” says Axiomise’s Darbari. “UVM testbenches take a lot more time to bring up compared to formal test environments even for a moderately complex design. UVM is people-heavy because the foundations of UVM require an enormous amount of human investment writing sequences, which do not end up rigorously testing the DUT. That shifts the burden to functional coverage to see where the gaps are. In many cases, simulation engineers just do not have time to understand the design specifications. In many cases, verification engineers do not have education in RTL design. Expecting them to know the details of micro-architecture and architecture is asking a lot.”
Put simply, the problem has outgrown the tool. “I don’t think UVM is running out of steam,” says Gabriel Pachiana, virtual system development group at Fraunhofer IIS’ Engineering of Adaptive Systems Division. “It remains a great tool for its intended purpose. What we need is to leverage it by building more verification software on top of it, for example, to address hardware-software verification complexity.”
Shift left
The term shift left has been bandied around in the industry, but verification desperately needs to shift left. That means doing verification much earlier on high-performing abstractions of the design. Existing simulators or emulators do not have the necessary performance, and waiting for RTL is too late in the process. “Shift left in this process means applying verification to SystemC algorithm models or virtual platforms,” says Dave Kelf, CEO of Breker. “This greatly simplifies specification to design verification. As such, developing a system verification plan on a virtual platform, and then re-applying it during system verification on an emulator or prototype, may well provide enough methodology streamlining to make effective system verification a reality.”
Still, the entire process has not been fully developed. “If the virtual prototype is the golden model, how do you go from that all the way down to a chip and know that the chip is still correct?” asks Tom Fitzpatrick, strategic verification architect at Siemens EDA. “Things like physical prototyping, FPGA prototyping or emulation — having the same view of the system, regardless of the underlying engine — is going to be really critical for that. Verification engineers are going to have to start looking at the infrastructure in that way. They need to make the underlying environment invisible to everyone in the team. That’s where something like Portable Stimulus comes in. Because of its abstraction, you can think about tests in terms of the algorithms, what you want to have happen, and where the data goes without worrying about the underlying implementation.”
It has to start with the virtual prototype. “We need better architectural exploration early on,” says Salz. “We need to look at power, throughput, and latency with a factorial number of permutations. Do you keep the cache with the CPU, or do you move it out to a separate chip and make it bigger? These are tough questions. It used to be that every company had one guy, the architect who could do this on a napkin. We have reached the limits of that. It’s just not humanly possible. Everything will get thrown into the virtual prototype. You will throw virtual models, emulation, simulation, and possibly even chiplets that have already been built. You can have post silicon connected. Virtual prototypes run at three to four gigahertz. Emulators don’t get close to that. You will get a lot more throughput, but the tradeoff is that you lose some timing accuracy.”
Some of the engines already are tied together. “The ability to do hybrid modeling is growing,” says Matt Graham, product management group director at Cadence. “Our ability to bring in C models and fast models and connect that platform up to a simulation or emulation is getting better. Further up the chain is the idea of digital twins. Simulation is not going to have 100X capacity, or become 100X faster. We’re going to have to be intelligent about this. We must find new and interesting ways to do abstraction. Virtual platforms is one of them. We need to embrace the digital twin concept by moving prototyping and emulation earlier in the flow, and find different ways of providing abstraction.”
Reuse within the flow is important. “Another promising direction is shifting design and verification to the left, meaning identifying errors early in the development process,” says Fraunhofer’s Pachiana. “SystemC and UVM-SystemC are useful for this task. While this adds another layer to development and consumes project time, the key is to re-use early-stage efforts and demonstrate the benefits.”
The industry does not like revolution. “Nobody is going to completely overhaul the way they do things,” says Siemens’ Fitzpatrick. “That’s just a fact. That’s part of why it’s been incremental thus far, because there’s only so much you can do. That’s where things like Portable Stimulus come into play. That was intended to be a revolutionary step in an evolutionary framework. The ability to take advantage of existing infrastructure but add additional capability that you can’t do with UVM. That’s the way it’s going to be successful.”
The challenge of building the model remains, however. “We’ve gotten better at verifying the models when you build them,” says Cadence’s Graham. “There’s more availability of models, certainly for processors, and models of protocols, models for stuff within a CPU subsystem like coherency and performance. That is the next level of abstraction, but to build a proper digital twin you need a reliable way to build models.”
That requires some clear thinking. “We need to be bold and just accept it – more and more simulation cycles and mindless functional coverage will not find all your bugs,” says Darbari. “I love formal, and I truly believe it gives the maximum ROI compared to any other verification technology simply because it can provide exhaustive proofs and reasons about what rather than how. However, I’ve also seen mindless application of formal resulting in a poor yield. Thinking about requirements, interface specifications, understanding the relationship of the micro-architecture with architecture, and software/firmware, will make it easier for everyone to look at the big picture but also grasp the finer details leading to better verification methodology.”
AI to the rescue?
AI is integrated into many aspects of design and verification. “Verification has latched on to the excitement of breakthroughs with AI,” says Graham. “Customers are asking what we are doing with AI? How can we leverage AI? We need to catalyze all of our engineers because we haven’t got enough of them.”
There is some low hanging fruit. “You don’t have time to simulate all the things that you want to simulate in a reasonable amount of time,” says Keysight’s Mueth. “You could utilize AI assistance by drawing correlations in the simulation data to say, ‘Based on a, b and c, you don’t need to simulate x, y, and z. That’s a typical AI type of problem. But you need lots of data to fuel machine learning.”
There are several ways in which regression can be optimized. “When you make a change in the design, which tests target that area,” says Salz. “You only need to run a subset of the tests. There is reinforcement learning, which can prune a particular test. If this is very similar to a previous test, don’t run it. That way you can maximize test diversity. Before, all you had was different random seeds to create test diversity.”
Time pressure is increasing, and which has put a spotlight on efficiency. “Twenty years ago, there was a quality gap,” says Graham. “Broadly, the industry knows how to close that quality gap today. Now, it is an efficiency gap. And that’s why everybody is talking about shift left, and productivity, and time to market, and time to turn around. That’s leading the push into looking for ways to leverage AI. Productivity improvements are not going to come from the same things that came from 10, 20 years ago.”
Large gains will come from very different approaches. “Many problems are caused by ambiguity in the specification,” says Salz. “I have hopes that we can use GenAI large language models to parse the specification and use it as a co-pilot. It can then ask if this is what we meant. What’s missing from GenAI is the ability to generate a timing diagram, or generate UML, so that the designer or the architect is in the loop. There’s hope that we can have tools go from language specification to more formal forms of specification and automate some of that. This is not using AI to write the design. I don’t believe that’s there yet.”
But this could fill in the model creation gap. “I’ve seen a couple of papers about using AI to build these models, whether it is top-down — where I read a specification and generate a C model for that — or a bottom-up, observational-type modeling technique where you observe what an RTL model does and then statistically build a model at some higher level of abstraction,” says Graham. We’re not there yet. But I think that’s one of the potential usages of AI. It might actually really help solve a very tangible problem.”
Conclusion
There is a growing tool gap in verification. Existing tools are not capable of dealing with system-level issues, and that is where the most complex issues hide. While there are some new languages and tools being developed to fill the void, they have been slow to see adoption. The industry appears to be stuck on the RTL abstraction, which is causing bottlenecks in model execution.
To encourage development teams to migrate to higher levels of abstraction, new tools are needed to fill the modeling gap in either a top-down or bottom-up manner. While AI may be able to help, that capability does not exist today.
Related Reading
Trouble Ahead For IC Verification
First-time silicon success rates are falling, and while survey numbers point to problem areas, they may be missing the biggest issues.
Leave a Reply