Inevitable Bugs

What differentiates an avoidable bug from an inevitable bug? Experts try to define the dividing line.

popularity

Are bug escapes inevitable? That was the fundamental question that Oski Technology recently put to a group of industry experts. The participants are primarily simulation experts who, in many cases, help direct the verification directions for some of the largest systems companies. In order to promote free discussion, all comments have been anonymized, distilling the primary thoughts of the participants. The list of participants can be found at the bottom of the post.

At last year’s discussion, a formal capability maturing model was presented. This year, Oski built on top of that by presenting a formal signoff playbook. One of the first questions they asked was, “Are functional bug escapes inevitable?” They were somewhat taken back by the responses they received. While it’s generally accepted that it is impossible to remove every bug from a design, the inevitability of bugs has to be tied to design complexity and the familiarity of a design. Bugs in general may be inevitable, but that’s not necessarily true for any particular bug.

Once that a bug is found, possibly very late in the verification process or in silicon, is it possible to look back and come to any conclusions about how the bug had escaped to that point? Participants were quick to point out that all verification is finite in terms of the time and resources that can be expended on it. That in itself implies that bugs are inevitable. You accepted that when the timeline or budget was set. In addition, routing out some bugs may be beyond the ability of formal, or simulation. At some point you have to accept that bugs will escape into silicon, and the hope is that those problems will be handled by software workarounds. So the scope of the question was narrowed to high-impact bugs that need to be found in project time.


Fig. 1: Categorizing bugs. Source: Oski Technology

Another question was posed: “When one of these bugs is found, how many blocks are impacted?” The general consensus was that the vast majority of cases come down to a single block of RTL that has to be modified. This does not imply how many blocks are involved in the manifestation of the bug, but in general only one block is modified to resolve the issue. It is possible that there may be a choice as to which block is changed.

The reason is these bugs tend to be found late stage or in silicon. Tracking down the root cause of the problem may involve multiple blocks, but because of where you are in the project timeline, there is a desire to minimize the disruption to the system. There was acceptance that this type of fix is often not a true fix but a hardware workaround. That translates into making the fix as localized if possible, even if it means giving up a small amount of performance. In some cases, the decision may be made based on the availability of spare cells, or restriction to a metal change if it is very late in the process.

Can we predict where this type of bug will exist? In many cases we can. It is likely the most complex control block that is fielding asynchronous events and has interfaces on both sides running on different clocks. While most teams know this block will likely cause trouble, and they would always want their most experiences team members working on it, they never have enough time to explore all corner cases.

This is where use cases become important. While there might be a bug buried somewhere in a weird combination of states, if that will never show up in the use cases that will be exercised in real life, you may not care about it. Some people may call these corner cases, and if they can never happen they are not important. But things that can happen in the real use cases are important, and it’s also important that the use cases are well defined.

The discussion identified three fundamental questions:

  1. Can we predict the blocks where the high impact bugs of a design will be located?
  2. Can we find all bugs in those blocks?
  3. How do we know when we are done?

When put this way, there was immediate dissent. This is basically a fallacy because it implies that bugs are findable from some golden source, and therefore you’re verifying to an architecture definition. In the architecture definition, we have to be perfect in order for there to be no bugs. Therefore, there are bugs.

The second question was quickly revised to, “Can we locate all meaningful bugs in those blocks?”

Oski’s Shirley provided a history recap of how the industry had gotten to where it is today, and the important element of this is that when frequency scaling and power-efficiency scaling ended, we pivoted to parallelism. That, in turn, caused an explosion in state space. Parallelism can exist in the processor space, in communications lanes in all aspects of the system. This received some pushback, because adding more cores does not necessarily increase complexity. It is the out-of-order, superscalar machine that adds complexity. Complexity is in the pipeline of those machines. To get a 1% increase in single-thread performance causes the state space to far more than just adding more cores.

Other people agreed with the original statement, saying that we have added different levels of complexity when we are putting in hundreds or thousands of cores and trying to control the parallelism, the synchronization of those cores, as well as doing resource sharing, and having multiple lanes and multiple ports so that packets are distributed across multiple lanes. In the end, everything has to come together.

But the focus of the discussion remained on the single cores and the complex blocks within those cores. It is felt that these blocks are still a tougher verification problem than SoC-level verification, even though it’s possible to have IP bugs that only show up when they interact with the full SoC. That begs the question, “Should you have been able to detect those at the IP level?” This points to the use cases for the blocks not being defined correctly. Alternatively, it is a specification problem in which two or more blocks do not work properly together even though there are no implementation bugs in either block.

The discussion again came back to the notion of “inevitability” of bugs. It is the word that was troubling to many. It was suggested that justifiable, or disappointed but not surprised, might be more appropriate terms. A typical reaction is that given the amount of work, we thought it would not have escaped, but it did. At the same time, we cannot be surprised because we didn’t put in unlimited work. It is not inevitable because if we had kept at it, we would have found it.

This comes down to having defined a good verification methodology that is the most likely to find the bugs you care about within project time. When bugs do escape, you conduct a post mortem and decide if the methodology needs to be changed in the future. Perhaps it was a coverage hole, or the wrong tool being used to tackle a class of potential problems.

There are some areas of a design that we’d be shocked if we found bugs because they were so simple, and that is when we are disappointed most. This can sometimes be related to bad staffing decisions. Teams do identify the most likely places for bugs to exist, the thorniest issues they’re going to face on a chip, and they put their best people on that. If they forget about the simple stuff, that’s when avoidable bugs occur.

Can we say there are certain bugs that simulation would not find? Even deploying emulation, which can provide more thoroughly checking, may be unable to reach them. If the bug is 20 levels deep and requires many combinations of things to happen, bugs will escape even if you suspect they are there. Emulation may get you to prototype faster, but it is not as good when used as a simulation enhancer. There are times when you have to use a more advanced methodology, like a formal, which has a chance of finding it.

Bugs are inevitable if you do not have the right set of tools or the right set of people. So if a simulation team could not have found this bug, because it is such an extreme corner case, then should it be considered inevitable? Some teams have found issues using formal that were missed during simulation. This brings up ROI, which can be a thorny issue because when this analysis is done after the fact. It is fairly easy to show how simulation could have caught it, given more time or resources. But how much more?

This requires a careful examination of the simulation methodology. Could a constrained random solver have gotten you to that particular scenario? This is very different than just saying if given infinite time, we would it have found it. When formal is used to find a bug, it generates a concise counter-example that can be recreated in simulation. When you know where the bug is, you can hit it.

Simulation methodologies are highly established within companies and there are a lot more resources available to it. The ecosystem is better, the simulation infrastructure already would be there. But when you look at a new company, with a brand-new design, is there the opportunity for something different?

You now have similar startup costs for both simulation and formal. The cost equation usually favors simulation only because it’s already there. When you start from scratch, the equation is different.

But there are problems with this, as well. Few systems are 100% verified with formal. There may be some core blocks in a design that are formally verified, but for most designs, simulation also will be used. A complete verification methodology will require both infrastructures to be built, but the split between the two may be very different.

Oski went through several examples, trying to refine the distinctions between bug categories. Complexity again featured very prominently in the discussion.


Fig. 2: Case study involving a GPU pipeline. Source: Oski Technology

Unless pipelines are of sufficient depth and the requirements for a bug to appear involve the injection of asynchronous events, then bugs should be considered disappointing, not inevitable. This is because constrained random should be capable of providing good coverage. There’s a set of conditions that a constrain random instruction sequence generator should be able to generate in a reasonable amount of time and simulation resources that you should expect to have for a project like this.

Another example involved a multi-way cache. There was more acceptance that this example did have some very tricky aspects to it, and that it would be simulation-resistant. However, it also was felt that some of the techniques used to assist with formal verification could have equally been applied to the simulation case, making it a lot more likely that the bug would have been found.


Fig. 3: Case Study of an instruction cache. Source: Oski Technology

Methodologies are constantly in flux. At the lowest level it is adjusting the coverage model as the team becomes more familiar with the design. Nobody ever gets 100% functional coverage, but you have to constantly assess if the goal is what you need it to be. When bugs are found — at any stage in the development processes — you are constantly assessing if you need to make adjustments. The bar is getting higher and higher. Your methodology has to adapt to bugs that you’re constrained random does not find. What may have been a superbug just became a regular bug.

Should we be looking at the coverage model as a line that separates the boundary of what you’re not going to find? When we find one of these supposedly inevitable bugs, it’s not that it was really inevitable. It’s just that we drew the boundary of our coverage model in the wrong place.

All methodologies have an Achilles heel. They all assume the verification environment is perfect, which is never the case. Stimulus may be perfect, but there could be bugs in the checkers, or inadequate coverage models. Formal verification is based on the defining a certain set of properties and assumptions. But one important element in the ROI case is how quickly a suspected bug can be analyzed. Is it a bug in the design or testbench? How long does it take to understand what is happening and to get to the root cause?

Using the right tool is an important aspect of a methodology. For example, some people have tried to wrap their heads around the notion of using simulation to detect deadlock, while others look to solve that using formal and still others strive to have better design principles that have avoid the issue.

If you have expected concurrency, you can handle that fairly well. It is when there’s unexpected concurrency, which normally comes from asynchronous events, that you are likely to have problems. When that happens deep inside a design it becomes increasingly difficult to generate all of the combinations, and these can change when minor changes are made to the design. This also can create unexpected issues if the design is reused with small modifications. The impact of those changes has to be carefully considered.

In conclusion, the group felt that there are a class of blocks are easier to close with formal compared to simulation. However, companies have been having success with simulation, and they should carefully consider how they continue to verify the highest-risk blocks. ROI is an important consideration. Companies that already have accepted the value of formal are more likely to be aggressive with its adoption, pushing the line between simulation and formal to achieve a higher verification “bang-for-the-buck.”

It takes time, but when formal finds a bug and provides a concise counter-example that has identified the root cause, designers tend to be impressed. But they need to see something that will build that belief. So long as formal does not find false negatives, it will gain credibility. Oski asserts that its false negative rate is just 9.1%. When this is compared to typical simulation false negative rates, which can be up to 60% in the early stages of the verification process, this number is very low.

Present in this discussion were Wayne Yun, AMD; Mark Glasser, Cerebras; Anatoli Sokhatski, Cisco; Shuqing Zhao and Hari Krishna Reddy, Facebook; Erik Seligman and Vinod Bhat, Intel; Ty Garibay, Mythic; Ambar Sarkar, Nvidia; Jay Minocha, Provino; Jacob Chang, SiFive; Carlos Basto, Synopsys. Presenting from Oski were Craig Shirley and Roger Sabbagh.



3 comments

Bernard Murphy says:

Great question and good discussion

Steve Hoover says:

If you had asked me 10 years ago, I would have answered, “of course it is not feasible to have bug-free silicon.” I no longer believe this. Even at today’s scale, I believe bug-free silicon could be possible. We just don’t have the tools to support it. What’s missing is convenient mechanisms to bridge abstraction levels in incremental localized steps–to divide and conquer the verification problem. I believe it is possible to bridge all the way from ISA to CPU RTL and even system to RTL with the right tools and methodology. That’s what I am working toward, one step at a time, with Redwood EDA. You can watch it play out over the next decade, or be a part of it.

Kev says:

You really shouldn’t get bugs in (digital) things properly specified, the bugs should just be in the analog domains of power and RF where things are less binary.

Leave a Reply


(Note: This name will be displayed publicly)