Engineers Or Their Tools: Which Is Responsible For Finding Bugs?

As chips become more complex, the tools used to test them need to get smarter.

popularity

Experts at the table: Finding and eliminating bugs at the source can be painstaking work, but it can prevent bigger problems later in the design flow, when they are more difficult and expensive to fix.  Semiconductor Engineering sat down to discuss these issues with Ashish Darbari, CEO at Axiomise; Ziyad Hanna, corporate vice president R&D at Cadence; Jim Henson, ASIC verification software product manager at Siemens EDA; Dirk Seynhaeve, vice president of business development at Sigasi; Simon Davidmann, formerly CEO of Imperas Software (since acquired by Synopsys) and Manish Pandey, fellow and vice president R&D at Synopsys. What follows are excerpts of that discussion.

L-R: Axiomise’s Darbari, Cadence’s Hanna, Siemens’ Henson, Sigasi’s Seynhaeve, Imperas’ Davidmann, Synopsys’ Pandey. 
L-R: Axiomise’s Darbari, Cadence’s Hanna, Siemens’ Henson, Sigasi’s Seynhaeve, Imperas’ Davidmann, Synopsys’ Pandey.

SE: Stopping bugs at the source is always a burning question. What’s your 50,000-foot view on this?

Seynhaeve: The best way to avoid bugs is to try to avoid them while you’re making the code, while you’re doing the specification and architecture. This is essentially the introduction of the Shift Left methodology. We go extreme left. While you’re still creating and writing your RTL is when we are going to assert that you are doing a whole bunch of things correctly.

Henson: The answer to eliminating bugs is exhaustive verification at the C-source level, so your designers will create test benches and scenarios, assemble whole systems with large data flows and large data sets, and exhaustively test that and record the coverage. We have tools for coverage metrics, and we think that recording coverage metrics and meeting your coverage goals really are essential to eliminating bugs upfront when you shift left.

Davidmann: Moving stuff into the design is absolutely spot on. One of the things I’ve always seen as important is trying to get people to put better-quality code in at the beginning. Then you’ll get less bugs. That to me is all about having better solutions at higher levels of abstractions, and better verification technologies built into the input, so that you do it almost correct by construction — which is not a real easy thing to do in hardware design. You can use more abstraction, but you need very much control. I’m not a great fan of high-level synthesis and C and C++ because they don’t give the hardware designers the control they need. For some algorithms, sure, abstract, yes. You can synthesize things that work, but a lot of hardware design is very precise and very detailed. However much we as an industry and the whole world wants to move to more abstract hardware design, it’s a lot of very detailed things. You need better abstractions in the hardware languages.

Hanna: There are many categories of bugs, and we cannot treat all of them the same. There are functional bugs, implementation bugs, timing bugs, layout bugs, and DRC bugs. The common way to avoid such bugs in every iteration is for designers or architects to be careful with the specification. Spec-based design is the way to capture the intent as early as possible, review it in a methodical manner, and make sure that what needs to be implemented is being understood and reviewed. It can be done with high-level models. People talk about abstraction. It’s just a way to minimize details of that step. And high-level models, reference models, and virtualizations depend on where they are used in your design flow. Basically, you need to capture the spec, review it, be able to implement it, and to measure the quality in terms of coverage and other metrics of PPA (performance, power, area). There is one method to avoid bugs — think about the process, capture what you need to implement, and make sure that the implementation is measurable and can avoid such consequences. The cost of bugs can be dramatic if you find them later in the design flow. It’s better to find them as early as possible. It’s the prevention methodology through spec-based design capabilities.

Pandey: This is a timely question as to what we do today, given the complexity of designs. There are two parts to the answer. One is that you want to raise the level of abstraction. We talk about bugs in RTL, but bugs in RTL is too late. In fact, some of the hardest bugs that you’ll see in designs today are architecture-level issues. In fact, these are algorithmic issues. For example, if you look at the way chips are being done today, these are heterogeneous systems. It’s not just uniform cores. You can have processor cores, different accelerators, and everything is stuck together in one place. You have to worry about cache coherence and the memory consistency model. These are amazingly hard to reason about, and you can throw in any number of emulation or simulation cycles, but many times these algorithm issues are discovered too late. You’re trying to close the RTL, and, ‘Oh no, I made this error, and this is something that I should have known about three years back.’ That’s one thing. The second part is that there are many basic algorithms. If you look at the systems today, there are a lot of different numerical computer algorithms. You’re worrying about intake and different floating point or numerical representations about where the error is, and how you manage them. You have to think about the C++ representation and how we verify some of these numerical computer algorithms. But things do have to be realized at the implementation level. What do you do out there? Even before you run a single cycle of simulation, look at the many different structural issues. In the modern system, there are hundreds of different clock domains all over the place, and data being transferred from one domain to the other. Power is an important part. How do you make sure that you’re functionally correct in the presence of clock domains and in the presence of power? There are actually a lot of amazingly simple static checks that help there. And the last part is, as you create these designs, on-the-fly checks are there. So before you run against simulation, you do a formula verification at the RTL level as you’re authoring it and catch as many bugs as you can. And the last part that one of my co-panelists mentioned is that you need to be able to measure all this. A continuous measurement is very important in knowing where you are going.

Darbari: Bugs are a natural consequence of code development. The challenge is to catch it closer to design bring-up so we can fix the issue in a timely manner, and this is where formal verification shines. From automated linting catching floating wires and un-initialized registers, to catching  and dead code, saving precious time in closing unreachable code coverage many months later, to finding sophisticated bugs such as deadlock, livelock and X-PROP issues in the presence of user constraints, formal technology encourages the designers to think in a more structured manner. It provides guided feedback from either automated formal technology, or by getting feedback on user-written covers and asserts. With safety, security and PPA being as important as functional verification, the beauty of formal is that you think about what your design is meant to do without thinking about stimulus generation, providing exhaustive coverage. With formal, you potentially get almost infinite stimulus, which allows formal tools to show you bugs in those rare cases within the first hour of design bring-up – bugs that could easily leak to silicon otherwise, using UVM-based simulation requiring hand-coded stimulus, which is the weakest aspect of simulation.

SE: When it comes to entering the design correctly, is it a combination of the engineer putting the code in properly, as well as the tool itself? How much of this is the tool versus the engineers input? With something like glitch power, there is a suspicion that some of it may be due to how the tool is implementing the design.

Darbari: Power is becoming a very important aspect now, especially in the context of energy-hungry AI chips. While most of the early work on power estimation was on static power, the dynamic power measurement is the most important task. But like everything else that gets simulated, the underlying model of the DUT becomes significant. Are we measuring the power at the software level, system level, or RTL level — or even lower at the transistor level — which would determine the scope of what can be measured. I’m not surprised to hear the tool can have bugs that can allow faulty measurements. Glitches appear due to incessant switching activity. The question to ask in the context of glitch-power is whether we are measuring it at a system level, such as in an AI chip operating at petaflops, or at the micro-architecture level by monitoring the switching activity of gates and transistors — electrical IR-related issues. Each abstraction level will offer different insights and serve different purposes, but also will have variance in accuracy. So how much tolerance in measurement can we build in the tools to get a better idea of the skew? If your point is more generic on whether tools can allow bugs to be implemented, the answer is yes, even with formal tools it is possible. The bugs can be in parsers, clock and reset analysis blocks, as well as in algorithms performing optimization during state-space search. I mentioned certified SAT solver work I did. The reason was to catch bugs in SAT solvers that all simulators, emulators, and formal tools rely on.

Pandey: The glitch power actually is something you see more in these modern technologies, as you’re going toward these angstrom-level designs. In and of itself, a single tool, if it’s something that’s introducing glitch power, that’s easy to do. It’s really a tool builder’s issue, an algorithmic issue. The trouble happens when you have designs built on multiple tools, each building on one layer of abstraction over the other. These issues happen when things fall at the boundaries.

SE: For bugs in general, is it possible that a tool would allow a bug to be implemented?

Seynhaeve: We need to take a step back here. We’re talking about verification, and what the example of glitch power illustrates is that the terminology of verification has changed. If we go back to the days where we had to handcraft our layout on a table, very well aware of manufacturing issues, the way things got verified was you created a prototype, and then you had the old fashioned computers —because real computers didn’t exist yet — emulate, verify the functionality as desired, and that the specification matched what came out of it. Then we started going into real computers with real software, where the first thing we did was to raise the abstraction level. We started introducing hierarchy in order to be able to keep up with manufacturing, because manufacturing allowed us to do 10, then 1,000 transistors. Now we’re at 150 billion transistors on a device. So the tools need to keep up, and that means the verification aspect of those tools is going to change. What we considered to be verification a long time ago is still verification, but it’s deeply hidden underneath a mountain of other issues that come up. When we talk about power verification, it used to be something that was dynamic and static, and that was it. Then, as the nodes change, that’s also a factor that will influence the overall evolution of the tools. All of a sudden, we had leakage power popping up. The tools that used to work just fine didn’t do the job anymore, so we had to create new tools where we started measuring the leakage power. Now, with finFET and gate-all-around, the leakage power is also under control and we’re struggling with the glitch power. Again, we need to change our tools. And as Manish mentioned, we do have the measurement under control, but now we’re trying to do a Shift Left on that, as we do with every generation of verification tools. The hard point here is that you actually do need a little bit of knowledge of the synthesis results, the gate infrastructure, in order to detect those glitches. But what we really want is to do it correct by construction even earlier, while we do RTL. Many people are going to say you will always need synthesis technology built into the tool you use to create RTL in order to be able to locate those glitches. That’s a topic of discussion, because I believe, maybe wrongfully so, that with AI they can notice patterns in the RTL description that help us make a glitch-free description at the RTL level. Maybe it’s a dream, but it does go to your question of whether our tools are keeping up? And the answer is yes, they absolutely have been keeping up. Otherwise, we wouldn’t have this beautiful chain of abstractions, where every time you introduce an abstraction, you introduce a transformation that needs to be verified. And then we spend the rest of our time asking, ‘Can we pull this verification more to the left so it becomes correct by construction?’ You will always have the specification, but you have to prove the specification is correct with verification, and then you have to maintain that verification. The tool chain is going to evolve and keep evolving.

Davidmann: When it comes to HDLs and inputting, there is this notion that we’ve got to try and raise the abstraction levels. It’s one of the things we did with system Verilog, with a lot of input from people like Intel, adding really simple abstractions. That meant you wrote less code, with more things being correct. And then we added things like assertions, putting into the design this notion, ‘This is what I’m expecting to see here. And if that’s not the case, tell me about it as soon as possible.’ As they’re entering the code at the early time, you could get some quality level to see if it’s what you expect. If you use a formal tool to find it out, you get assertions, you get warnings, you get reporting. In the technology we built with our processor verification, we’ve tried to raise the abstraction of that, too, where we have a reference model that can sit behind the RTL right from the start, and you can turn off most of the reference model. It’s just watching the PC, or a few of the registers, or instructions. So as you’re evolving your design from the early days, you’re getting full verification all the way along. It’s about bringing the verification as close as you can to the design entry to get out. That’s how I’m looking at it from an HDL point of view around processors, and systems around processes. At the design entry for the HDLs, and for the modern sets of AI type of LLVM hardware that’s coming out and everything, it’s all processor-centric. It’s about how well you can verify the processor. We’re focusing on how to model it abstractly and how to get the verification right at the beginning of the design process.

Hanna: You need an integrative toolset. When the designer or formal expert runs formal verification, they think about all the aspects of the design, not only the assertion from the SVA perspective. The products are smart scale, from the system level, System C, RTL, down to layout, to gates. It has to work on multiple dimensions, and the smartest expectations from these tools have grown dramatically because the design is doubling and becoming more complex. It will be 10 times more complex in the next five or 10 years. We cannot continue handling issue after issue in a local scope. You think about the big picture. The tools have got to be smart. Somebody mentioned AI. AI is great to help interleave the human and the machine in order to guide the designer with multiple aspects that the AI can find. The tool is not a way of crossing this data. To cope with this complexity and produce better designs, it is not about the level of abstraction. That tool has to be multi-dimensional, focusing on timing, functional security, safety and all the other aspects. This is the challenge, and we see it’s happening. I’m responsible for multiple tools in my organization, and the guidance to everyone is to think beyond the immediate scope of the product. Try to bring close things together, such as making the tool think about low power while doing the formal verification. Think about timing. What’s the timing path? The critical bugs come from crossing multiple aspects. Security bugs in functional, or timing issues in multi-cycle path, or a protocol that may be functional complete but implemented in a way that will cause lots of deadlocks. The design, the RTL, will have more resources, but is deadlocked while the protocol at a higher level of abstraction is deadlock-free. Integration of these tools, which may be able to be data-driven for multiple dimensions, is key for avoiding bugs and making things cleaner going forward.

Seynhaeve: You’re absolutely correct. We are paying attention to that even at the RTL capture level or the RTL integration level, but we need to realize that the PPA needs to be elevated to that same abstraction level. Even with Shift Left, there are things you can do by verifying the synthesis design constraints are correct, and that you’re creating reasonable buckets that go over all of your hierarchy. At the same time, we need to realize that the synthesis constraints are not suitable for working with power domains and clock domains. We need to use the UPF constraints to make sure that we don’t ask for unreasonable power implementations from what we write in RTL. Then, with clock domains, there’s a new standard coming up that has been defined by Accellera with multiple members to annotate in the RTL, and in a Shift Left way, that we don’t introduce errors where we have clock domains talking to each other without isolation or whatever it is that can be discovered early. I also like to think about the parallel with the exercise in which you have a bunch of wooden boxes with white marbles in each to represent the software. Then, you put some red marbles in each box that represent the bugs. Each of 10 teams gets a box and they try to remove the red marbles. That is the equivalent of verification. The people who removed the most red marbles are rewarded, and the people who removed the fewest red marbles are punished. Then the teams discuss the verification techniques they had used to get the red marbles out. But they all missed the point. The point is not how many marbles you can remove effectively. It’s how many marbles are remaining, and they all had marbles remaining. The chip went into production with bugs in it. That’s unacceptable. The problem is you need to figure out a way to design by construction so that only white marbles go into the box. That is exactly what we’re trying to do with Shift Left, but we can’t forget about the constraints. It needs to work in real life. Power, performance and area must be taken into consideration. We’re trying, even if we’re not totally there. But it’s a mindset we need to establish. It is possible, and we can do as much as possible even for your vulnerability issues and your safety issues. For your safety issues, we are doing a reasonable job. We have ISO 26262, we have the ISO 21434, and that can be reflected in coding style.

Henson: What I’m hearing is there are two sources of problems. One is design issues. Have I really purged the design of any functional bugs? Then there are implementation issues. As you go down the chain, the various departments do not talk to each other. There’s a wall between design, and perhaps RTL synthesis. Then there’s another wall between RTL synthesis and place-and-route. What you have to do in your corporation is break down the walls such that the people at the front of the chain know what’s coming out the back end. They know what the technology is, they know what the challenges are. They know you can’t work with huge memories, that you have to have a bunch of small memories, or whatever the constraints are for the footprint of the final device. And that’s all implementation issues. It’s actually easier to shift left and get your specification correct. It’s conceptually easier to purge your system from functional bugs right at the top of the system, but then again, you need to trickle through and avoid the implementation bugs by having a thorough knowledge of how you get from the beginning to the end.

Pandey: There are two classes of bugs, as Jim mentioned. You have these functional issues, and about safety, security and other functional non-compliant issues fit into the first category. The second part actually is really these physical bugs, like timing, electrical, and other things. Can we do synthesis? Many of the tools I’ve worked on say that if you want to do formal verification, you really have to have a synthesized model to design. Then, if you’re doing a synthesized model, how accurate and how much information can you get in there? With some of these models, it’s surprising that you can detect electrical bugs. You can mark some of the UPF logic to catch power intent bugs, and if you add some additional information, you can potentially even get some physical bugs out of there. That’s absolutely needed, and we already are moving in that direction in many tools, with some of our efforts in getting implementation information early on in the planning stage and catching them right at the RTL stage.

Related Reading
Bug, Flaw, Or Cyberattack?
Tracking the cause of aberrant behavior is becoming a much bigger challenge.



Leave a Reply


(Note: This name will be displayed publicly)