Bug-Free Designs

Does anyone really care if a design is bug-free? The cost probably would be prohibitive.


It is possible in theory to create a design with no bugs, but it’s impractical, unnecessary, and extremely difficult to prove for bugs you care about.

The problem is intractable because the potential state space is enormous for any practical design. The industry has devised ways to handle this complexity, but each has limitations, makes assumptions, and employs techniques that abstract the problem.

Industry best practices are to try and identify bugs that cause problems and to fix those. Three primary techniques are used:

  • Writing directed tests that target particular behaviors, and for which the outcome is known;
  • Employing constrained random test pattern generation that attempts to explore behaviors of the design and look for anomalous results; and
  • Deploying verification techniques that can prove that certain properties hold true for a design.

It is important to decide the best technique to use for each problem. “Some organizations have very advanced use models for formal, simulation and emulation,” says Ashish Darbari, founder and CEO of Axiomise. “Those companies actually do that very well. They have teams of people from simulation, formal, and emulation that sit together and routinely partition the verification concerns.”

Defining the process
That requires a process. “You want to write down all the different behaviors that you want to make sure work correctly,” says Mark Glasser, member of technical staff at Cerebras Systems (Glasser has since moved to Elastics.cloud). “We call that a coverage model. You need to develop a comprehensive coverage model, but you also need to look at workloads. What exactly are you going to be doing with this device? One particular workload may work well on this device, and another one may not on the same device. It is a combination of workloads and coverage model, and then just seat-of-the-pants.”

That may sound very untechnical. “Seat of the pants is directly tied to experience,” says Larry Lapides, vice president of sales at Imperas Software. “You also need to identify what’s new in the design. If you’re iterating a design, looking closely at the pieces that are new is important. It is now over 20 years since we first introduced the idea of formally written verification plans. Some people still don’t actually write them down, or at least in enough detail. This is where you get the different teams together, each with their different methodologies, and partition things out as to who’s going to do what.”

The existence of a verification plan means that a methodology is being adopted. “We have a fairly specific checklist for grading of features,” says Ty Garibay, vice president of engineering at Mythic. “First, we look at non-programmable state machines, and most importantly, non-programmable state machines that talk to other non-programmable state machines. The key is to make sure that the architecture did specify those control objects as state machines rather than just randomly coding RTL.”

There are many factors to consider. “You have to start by identifying where your highest risks are,” says Darbari. “Then you can look at your timeline, when are you going to tape out, the team that you have at your disposal, their skill level, the tools, the compute infrastructure that you’re going to throw your workload on, everything would count.”

There are different kinds of risk. “If you’re talking about business risk, you are considering if you are able to take this part and sell it to your customer,” says Glasser. “When talking about state machines, is the customer going to use all the states? If not, then all that extra work may make you more confident, but it doesn’t necessarily mitigate the risk of getting the part out the door and selling it to the customer. You need to look at what the customer is going to do with that chip and how it’s going to be used in the real world. That’s where you can start analyzing your risk from there.”

Are incremental designs easier?
Most designs are incremental, but that may be changing. “When doing a design from the ground up, it is the architect’s job to conceive of how to hook the pieces together, and to specify an architecture that can be verified,” says Mythic’s Garibay. “They specify the architecture and document it, define it in such a way that it has clean edges. The problem that flows through the whole sequence of design stages is then more tractable. If you don’t, you end up with a whole bunch of weird corner cases that people didn’t think about. Some people talk about design for verification, but this is even architect for verification.”

That on its own may not be enough. “RISC-V is a great example of open-source architecture where so many clever people are developing these specifications in the RISC-V Foundation,” says Darbari. “And yet with all of this combined intellect, people still make mistakes. RISC-V’s weak memory model is a classic example. You only have 13 axioms, and yet it took a Princeton professor to find some inconsistencies. Now, RISC-V isn’t the only processor specification that has had issues. Literature is full of memory consistency issues in all processor vendors. This is a hard problem, so somebody needs to take a step back, write it down. A lot of the challenges are in not knowing.”

All of these techniques rely on being able to correctly specify what we wanted to design, that we know how to define completion of the task, and that we have unlimited time and money to spend on the problem. None of these reflects reality.

The problem is summarized by Cerebras’ Glasser. “In the verification world, we talk about making sure that the design matches the specification. That’s the golden reference. We have some model of the specification, and we see if our RTL, our implementation, matches that. I would argue though that we don’t know if the specification is correct. This shows up in the formal world, too. You never know if you have the right set of properties.”

Verification and validation
This brings up a second issue. Verification tells you if an implementation matches a specification. Validations tells you if the specification is correct. These are both poorly defined. “We need all of the key technologies to come together, but we need to ask the right question at the right time and use the right technology for the right reason,” says Darbari. “Yes, I think verification is asymptotic, and we will never build a completely bug-free chip. But we will come very close to that, then be able to ignore any of the minor defects that may still leak. We can’t verify exhaustively. It’s a risk management exercise.”

Is that good enough? “We are building tens of billions of transistor devices that mostly work, and that’s almost verging on the level of magic,” says Garibay. “We have made a tremendous amount of progress, but we happen to be in a business that tends to grow the horizon every 18 months to 2 years. The challenge gets bigger and more interesting, which is why we keep doing it.”

There are certainly more things that have to be considered today. “It needs to not only work, it needs to work well for the use model, for the use case that the customers are going to use,” says Glasser. “Whatever application it’s going to be used for, you need to check it.”

Performance verification
Can formal be used for performance analysis? “While formal is not considered to be the de facto technology for performance verification, there are a lot of cases where you could apply formal,” says Darbari. “A lot of performance analysis is counter-based, making sure things happen within a certain bound, or exceed a certain bound. I know people might look at this with skepticism, but if you really were trying to establish a solid proof of whether those counters can exceed a certain bound, then you should consider formal.”

Performance is not just a hardware concern. It also involves software, and that assumes that real software is available. “My most recent project is a ground-up development,” says Garibay. “There was no software. There wasn’t even a compiler when we were doing the hardware development. There was barely a compiler by the time we got to silicon. We took our best shot. Now we’re trying to use that learning to develop a second generation. Even so, there won’t be software to drive true performance analysis. We work with modeling, with high-level models, to try to get some feel for how the performance is coming along.”

Resilient design
With all of these pressures, is it possible to keep up? “As we increase complexity, does increased use of design IP actually help to reduce the number of bugs?” asks Lapides. “It’s the same from a verification IP perspective. Chips that are going to be built at advanced nodes, are they going to have more resilience built in just by the very nature of the design?”

That may depend on the domain you are working in. “Resilience might work for mobile phones, for non-critical chip infrastructure, but if you are supplying chips to safety-critical domains, we can’t tolerate the failure,” says Darbari. “We could do more software updates, we could activate chicken bits, but that can’t be the starting premise of verification. When we verify this next-generation chip, do we start by saying most of it was working because it was deployed in the field for 20 years so it’s silicon-proven? It might have been silicon proven because it was deployed in another domain, but does it meet the requirements of this domain?”

These days, when system-level optimization is becoming more important, companies want software to become an integral part of the verification methodology. “Many companies have been using Agile for their software development,” says Lapides. “Some of those are starting to use Agile for their hardware development. What they were doing is bringing along the RTL and a virtual prototype, so that they could bring along their software at the same time. Now, when they run into a bug or an ambiguity in the specification, everybody is on the same page at the same time. You don’t have the hardware guy context switch and go back to something they worked on three months ago, because now the implementation, and the driver, and the virtual prototype being used for verification are all in sync. That makes the process a lot more efficient.”

Not everyone is convinced about that. “That sounds like hell to me,” says Garibay. “Trying to negotiate with the software team in real-time during design sounds daunting, but I can see what you’re saying. I might have to rethink that one.”

Building a design that is bug-free is probably impossible, and certainly impractical given time and resource constraints that every company faces. The companies that do the best job have well-defined processes and practices in place, attuned to ensure their designs are less likely to have debilitating issues while operating within their intended domains. Anything more than that is considered wasteful.

Leave a Reply

(Note: This name will be displayed publicly)