Trouble Ahead For IC Verification

First-time silicon success rates are falling, and while survey numbers point to problem areas, they may be missing the biggest issues.

popularity

Verification complexity is roughly the square of design complexity, but until recently verification success rates have remained fairly consistent. That’s beginning to change.

There are troubling signs that verification is collapsing under the load. The first-time success rate fell (see figure 1) in the last survey conducted by Wilson Research, on behalf of Siemens EDA, in 2022. A new survey is planned for early next year, but the industry is not hopeful the numbers will improve, or that these surveys are capturing the whole problem.

Fig. 1: Number of spins before production. Source: Siemens EDA/Wilson Research

The survey asks about the reasons for those respins, as shown in figure 2.

Fig. 2: Cause of ASIC respins. Source: Siemens EDA/Wilson Research

There are two sets of reasons for the increased failure rate, only one of which is told by the numbers. For example, figure 2 contains new categories that have been appearing over the years, but which were not concerns in the past. These new problems have emerged either because technology advancements create additional issues, such as thermal, new application areas such as automotive, which add demands for safety and security, or a dramatic increase in parametric issues that are making all digital circuits behave more like analog blocks, and analog blocks are almost impossible to fully evaluate.

In addition to the numbers, there is a human aspect that needs to be considered. Teams are expected to do more and schedules are being tightened. At the same time, the industry is going through a dramatic talent shortage, and new players are entering the semiconductor design industry with different requirements. These may not be fully met by existing tools and flows.

It is a complex mix of issues that cannot be addressed in a single story. This one will address some of the human issues and leave the more technical issues to a subsequent story.

Schedules and staffing
Statistics never really tell the whole story. “Something like 60-plus percent of respins were due to specification issues,” says Arturo Salz, fellow at Synopsys. “This could include misreading of the specification, human error, and understanding of the spec. It could also include rapid changes to the spec as the project moves along. This signals to me that people are under schedule pressures, and rather than wait for the next version of a product, they change it midway through the development cycle. All of the plans laid out by the verification team and designers go out the door.”

If this is the way things are now being done, then tools and flows must change to adapt to it. “Changes in requirements and specifications are a common source of production bugs,” says Gabriel Pachiana, virtual system development group at Fraunhofer IIS’ Engineering of Adaptive Systems Division. “One way to reduce this burden is to build consistent tooling from requirements to the final device. This ensures that requirements are met by tracking them, implementing them in design, and confirming intended behavior through testing, all linked together and easily followed.”

Engineers always are asked to do more with less. “While complexity is going up, the development window is going down, and when they have that kind of pressure, you have to start looking at where you take shortcuts,” says Chris Mueth, business development, marketing, and technical specialist at Keysight. “What am I going to simulate? What will I skip? People can’t simulate everything. They don’t have the capacity, the compute capacity, the time and so they have to make educated guesses on what they’re going to do and where to take risks.”

This is coming at a time when there are not enough engineers. “The big problem is that the talent pool, in terms of verification engineers, hardware engineers, is struggling to keep up,” says Matt Graham, product management group director at Cadence. “As a result, teams are either understaffed or have a lot of new staff. As an industry, we didn’t do a great job for a lot of years attracting new people. We have a talent shortage, in general, and that’s going to show up especially in verification. There is no algorithmic way to determine that we are absolutely done. When you don’t have enough people, the ‘not done’ list gets longer.”

The industry has been expanding rapidly, putting more pressure on staff. “It’s not just true in the IC design industry,” says Prakash Narain, CEO of Real Intent. “It’s true in EDA, it’s true in all the tech industries. The perception of the opportunities is a lot larger than the amount of resources available to service those opportunities. There’s a lot of competition, and it takes a lot of discipline to come out successfully. We go through cycles of perception of huge opportunity, many players jump in, there are some winners and losers, then things consolidate and settle down. Then the cycle repeats itself. We are in the middle of this big perception cycle, and for the time being, or for the immediate future, this problem is just going to persist.”

It takes time and the right resources to train new people, too. “I’ve been spending the last year or so trying to get universities to teach this stuff,” says Tom Fitzpatrick, strategic verification architect at Siemens EDA. “Unfortunately, there aren’t a lot of people who can do that. This is not an easy thing to do. We’ve been doing it for a long time, but to a college sophomore thinking about what they are going to spend their electives on, they need to know hardware design, software design, and then they need to know SystemVerilog, UVM, PSS. It’s very difficult to get them to understand, at that age, this is a good career path. The problem we’re seeing is that most of these people want to write machine learning algorithms and go work for Google. What they don’t understand is that machine learning implies that there is a machine.”

The staffing shortage creates a secondary problem. “In the Bay Area, people leave every two years and go into another company,” says Keysight’s Mueth. “Even on an analog chip you might have an RF expert, you might have an EM expert, you might have a thermal expert, a circuit expert — just on one particular thing. All these disciplines need to work together, concurrently and efficiently. And if they’re able to do that, then you can have more time to do more simulations, or you can figure out how to get your confidence higher collectively. It becomes a collaborative flow. It can’t just be a collaborative environment. Everybody’s going to talk to each other. Magical things are going to happen. It needs to be tuned to the application at hand to get those efficiencies. People are scrambling today.”

But if people keep shifting to new companies, it becomes increasingly difficult to build synergistic teams and productivity suffers.

Process immaturity
In addition to new problems emerging with each technology node, and advanced packaging technologies such as 2.5D, systems companies are building chips and chip companies are building systems.

“We are seeing a lot of first-time systems companies that are starting to do pretty significant designs right off the bat,” says Siemens’ Fitzpatrick. “Their processes may not be as mature as others. Even though they’re trying to use all these technologies, they may not be doing it efficiently or effectively yet.”

Different economics are driving these companies. “Every systems company has said to me, at one point or another, ‘If we can fix it in software, it’s not a bug,'” notes Cadence’s Graham. “Their version of first-pass success may be slightly different than someone who is developing a chip. If the chip is the product, that chip has to work. There is a different qualification for that. They may be less focused on that first pass success, or they may have a different version of what constitutes first pass success.”

There are many new things that have to be learned and mastered. “Multi-physics is being driven by 2.5D,” says Marc Swinnen, director of product marketing at Ansys. “Multi-physics are novel to a lot of designers. Where do they get the expertise from? A lot of these physics were used at some level in the flow. For example, thermal was used in packaging. They did a quick estimate of thermal to see that it was not going to exceed any limits. The expertise, if it is somewhere, is often in a different group. For a lot of these multinational companies, that group could be in Bangalore, Israel, or wherever. The problem is that the expertise is not in the team. What’s happening is an organizational problem. This person is assigned to a manager who’s in a different group and who has different responsibilities. And now you need to rejigger your organization managerially, which is often more of a problem than physics.”

The changes in technology are both pervasive and perpetual. “Design paradigms, or methodologies, are not static, because every time you do the next-generation design you’re trying to do something new, something different,” says Real Intent’s Narain. “New and different means that your past experience, which allowed you to cover for everything that you have done in the past, may not completely cover for everything that you’re attempting to do in the future. The very fact that designs are continuously pushing the envelope means there is a sufficient amount that changes in the design and the design methodology, and that leaves holes.”

When given a new problem, it takes a while for the right solution to be fully integrated into an existing flow. “It is not because we don’t have the right tools or the right methodologies,” says Synopsys’ Salz. “It is often improperly executed, or the immaturity of the process. Teams that start to use constrained random sometimes don’t have good functional coverage. They don’t have a good measure of how good the verification is. At the system level, we’ve got systemic complexity coming in. Then you have software, firmware, all sorts of things. With multi-die, it’s going to get even worse. It’s going to get gigantic.”

Methodologies that involve any type of software remain immature. “System-level verification might include firmware,” says Dave Kelf, CEO for Breker. “It is immature compared to block test, and UVM is hard to apply at this level. A realization of the importance of system verification and the fact that it cannot be accomplished simply by running real workloads has become critical, and many companies are now exploring this.”

Some teams are failing to keep up. “So many times I encounter projects where the team has decided on using simulation, decided on the resourcing effort, and plans have been made on the final tape-out date without a clear set of specifications being documented describing key features, configurations, changeset from the previous project, and key risks being identified,” says Ashish Darbari, CEO for Axiomise.” When we say methodologies, it does not mean just how a particular verification technology would be used. Instead, it means how the combination of different verification technologies, tools, and methods should be combined to ensure the best outcome. We can bring in formal, portable stimulus, emulation, or the next big thing. But unless we have the courage to accept what is wrong, we cannot change for the better.”

None of this is a criticism of the engineer or the companies that employ them. “These are very good engineers, they are very well motivated, and they cover their methodologies very well,” says Narain. “But the amount of new in some cases is a lot higher than more traditional design houses, where the amount of newness is relatively less. But in terms of the skill set, and the quality of methodologies and work and engineers and everything — if you look at that, there’s really not much difference between these two. It’s just an element of newness.”

More to verify
Semiconductors are finding their way into new application areas that often have new requirements.

“The drop in success rates is not solely due to increasing design complexity,” says Fraunhofer’s Pachiana. “Designs are now exposed to relatively new hazards, such as safety and security issues, which are evolving over time. Particularly in the case of security, new and different types of attacks are constantly being developed.”

Safety and security are not the only area that is evolving. “There are three axes of verification, says Axiomise’s Darbari. “First is functional, second is PPA, and third is safety and security. Whether we look at automotive, where functional safety is paramount both from reliability, as well as liability (conformance to ISO 26262), or IoT where security is the main driver, you cannot wish safety and security verification away. The challenge we have is that the verification community is still very heavily focused on, ‘Let’s build our testbenches, send in some sequences, and think about the micro-architectural specs later.'”

Often, multiple factors build on top of each other. “One of the ways that we address safety is things like triple voting flops,” says Graham. “If you triple the number of flops, you are driving gate count higher, and that is part of complexity. It is also driving the need for things like 3D-IC and chiplets. These require a broad array of experts. It’s no longer just a digital designer. We need a digital and analog designer, a digital verification engineer. Now you need a safety verification engineer or a safety designer. And then the system-level expert. You need all of your teams to have some expertise in this, and you need these vertical experts, which is driving up the requirement for more talent. You’re either taking a good verification, or good design engineer, good place-and-route person, and saying you’re going to be our expert in this vertical. Now you need to replace that person. Or you need to bring that in, and suddenly your team is that much larger.”

Functional verification has become totally wrapped into the entire development flow. “Power is a good example,” says Salz. “You need to be able to estimate power early on, and make sure, at the system level, that you’re staying within that envelope. Some things are going to get worse. Thermal is a big thing with multi-dies. You need to start thinking about all of these things. You need to think about how you are going to test it once you’ve integrated the chiplets. Are you going to test as you integrate, or are you going to just build the entire SoC and then test it?”

Some point tools may not yet be ideally integrated into flows. “Thermal is a new consideration, but many of the tools available use a brute-force dumb approach of doing synthesis to get everything to gates, just to figure out the power,” says Swinnen. “Synthesis takes a long time, a lot of resources, and is very capacity-limited. A better approach is to use heuristics that are based on many years of experience. These look at the circuit and divide it into different function types. This is random logic, this is memory access, this is clock logic, this is data path logic, and apply different heuristics to each of these regions. It builds up an estimate of the power in the system. It is often within 5% of the final gate level numbers. The point of this is optimization. ‘I’ve got this architecture. What if I change it? Would that increase the power or lower the power?’ You’re not worried about the last watt. You’re worried about if your optimization is increasing or decreasing the power. It is also the time in the process where power and thermal are being understood, and which cycles need to be identified for use in other parts of the flow.”

Conclusion
There is an increasing amount of newness in many designs today. This is driven by technological advancements, domain requirements, and changing views of the system and the methodologies required to design them. It often requires the influx of new expertise that can be difficult to properly integrate. The need for larger, more diverse teams is hampered by an industry-wide lack of talent, and this is causing significant instability within companies. The fact that there is a dip in first-time success rates is perhaps unsurprising, but the real question is whether this is the beginning of a longer decline, or if this is a wakeup call for change.

Related Reading
Formal Verification’s Usefulness Widens
Demand for IC reliability pushes formal into new applications, where complex interactions and security risks are difficult to solve with other tools.
New Concepts Required For Security Verification
Why it’s so difficult to ensure that hardware works correctly and is capable of detecting vulnerabilities that may show up in the field.



Leave a Reply


(Note: This name will be displayed publicly)