Communication Is Key To Finding And Fixing Bugs In ICs

AI will be increasingly needed to make sense of huge amounts of data, allowing teams to work more effectively.

popularity

Experts at the Table: Finding and eliminating bugs at the source can be painstaking work, but it also can prevent even greater problems from developing later on. To examine the best ways to tackle this problem, Semiconductor Engineering sat down with Ashish Darbari, CEO at Axiomise; Ziyad Hanna, corporate vice president R&D at Cadence; Jim Henson, ASIC verification software product manager at Siemens EDA; Dirk Seynhaeve, vice president of business development at Sigasi; Simon Davidmann, formerly CEO of Imperas Software (since acquired by Synopsys) and Manish Pandey, fellow and vice president R&D at Synopsys. What follows are excerpts of that discussion. To view part one, click here.

L-R: Axiomise’s Darbari, Cadence’s Hanna, Siemens’ Henson, Sigasi’s Seynhaeve, Imperas’ Davidmann, Synopsys’ Pandey.
L-R: Axiomise’s Darbari, Cadence’s Hanna, Siemens’ Henson, Sigasi’s Seynhaeve, Imperas’ Davidmann, Synopsys’ Pandey.

SE: When it comes to sharing data, what needs to be changed in the structure of design teams?

Darbari: Information drafting (specifications) and sharing that information are two key elements of any verification task that relies on requirements. In formal verification, the gap between specifications and implementation verification is thrown wide open very early in the verification phase, as formal is brittle. With an almost infinite stream of stimulus coming into the DUT, a lot of which can be an illegal stimulus, our reliance on well-crafted specifications is ever so important. The problem is that, for a lot of functional verification of micro-architecture, we rely on micro-architecture specification documents. But it is not easy to describe the side-band information. Consider, for example, verifying a command processor in a GPU or an AI chip with formal end-to-end. We expect to know the behavior of every top-level pin of the processor, and there may be a few hundred of them. Typically, these are driven via the GPU sub-system or firmware/drivers or software, and the processor is verified in-situ in emulation. How on earth can a designer know the end-to-end behavior of every pin of the command processor we are verifying with formal? We work with design teams to help them develop this information, and then use this in formal and validate it in simulation and emulation. In general, however, this is not a solved problem, and it can only be made better if verification teams have an audience of architects, designers, firmware, and software teams, where everyone collectively agrees on the interface contracts between software and hardware, and within hardware. One thing we often see in formal verification-based work is the treatment of exceptions and error handling, which is often not well specified or understood, and there is often a wide gap between software and hardware teams. We often hit upon these very early in the project, and upon debug the schisms are well exposed. Then we try to bring the teams together, get clarity, refine the specification via assertion/cover, and then prove that it matches with the implementation. In addition, there is another feature of this information structure that is important to highlight — configuration-dependent design behavior. There are too many configurations to be tested, and clearly scoping out which ones are important and urgent is the key to identifying bugs and addressing time-to-market.

Davidmann: There are a lot of bugs between the software and the hardware teams. Almost every system we deal with, because we’re dealing with processors now, is all about software sitting on top of all these processors. A fabric of many processors can be 100 cores, 500 cores. There are all of these software issues. There are not bugs in the software, and there are really not bugs in the hardware. There are bugs in the way these two often are specified to work together, or there are bugs in the implementation. There are lots of issues related to that in the modern systems. Yes, there’s all of the unbelievable complexity in hardware implementation and verification, but silicon doesn’t work without software. So you’ve got to take into account that the system has to function, which means you have to get the bugs out of the software, the bugs out of the hardware, and the bugs out of the interaction of these two things. That is one of the biggest challenges going forward, because in hardware we’ve evolved a lot of how we can do more ‘correct by construction,’ whether you go down the high-level synthesis route to try and make it easier, or you add more formal to the RTL. But these systems that are a combination are the challenge we face going forward. So it’s not all about the HDL and the silicon and that sort of stuff.

Seynhaeve: The first thing that you’ll notice with the teams is that they’re getting bigger. That is because the manufacturing side introduces an enormity of new complexity at the back end, because we’re going down in sizes so there are all kinds of new problems that pop up. For instance, glitch power. And then at the front end, the complexity of the algorithms is increasing to the level where they no longer can be comprehended by one person anymore. So the design teams are growing to deal with the complexity at the front end, and the verification teams are growing at the back end to be able to deal with all kinds of weird things. Apple was talking about their power modes, and that in certain power modes their registers become transparent. How the heck do you deal with that and your existing timing engines? How do you predict timing? How do you handle the power management if they stop the thing from toggling, or if they stop the register from being a register? So as the teams get bigger, are they going to blur? I don’t think so, because the skills that you need at every level of verification are so unique and so different. In the front end, you need domain knowledge. Even in the domains of artificial intelligence and communication, the algorithmic knowledge you need is so complex and detailed that you can’t also take on verification issues. And then for the verification issues, they have their own way of solving problems, and it’s even mundane skills. A functional person is a person who knows how to make files, and they will build their design, keep building it, keep linting, keep making sure that functionally it is doing the right thing. A verification engineer is going to be a scripting kind of guy. You’re going to say that’s the same thing? Well, no, it’s a slightly different skill. But are the teams going to show different types of verification and design handled by the same person? I don’t think so. We will have sharper and sharper lines between design engineers and verification engineers merely based on the skill set, and there will be more of them.

Hanna: The complexity is going to grow in terms of the implementation, and the size and complexity of the design itself. The key thing here is the communication of knowledge, and the communication between the software stack people and the designers — inside the design, between the validation engineers, the architects, the circuit designers, and all the functions. They all need to be in harmony. It’s hard to do that, and the breakthrough that could happen will be a result of improved communication. Observe how fast startups with 5, 10, 15 people are able to get an SoC out to compete with big companies. It’s because they sit together and communicate, and they fix it together. In a big company, there are silos between different organizations. When I was at Intel, we came up with the concept of a dungeon. Basically, you put 15 people in a conference room, work together for a week, try to nail down these complex things — prove the protocol, do better synthesis, improve the methodology. You cannot imagine how much productivity such a team can deliver. The other challenges are the amount of data being produced by the tools, by the collateral information, by the design itself, by many, many sources. It’s huge. AI thrives here because it can aggregate these things in one place and communicate. There is a lot of data in design automation, and the whole electronics domain, and we can leverage that. This will improve the communication, along with whatever analysis we need to do on top of it in terms of planning, execution, timing information, functional information protocols, software interrupts. All the data exists, but today is scattered and some is repetitive, generated every night, with nobody using it. This wisdom among the data aggregation can dramatically improve the way that we address more complex designs.

SE: Isn’t that dungeon concept still being used in the hyperscaler community for chiplets and 3D-ICs?

Darbari: The ability to co-locate and work face-to-face has incredible benefits for closing out gaps quickly and mitigating unnecessary bureaucracy. Having said that, if you bring in designers and verification people in the same room, chances are that the bugs would be fixed before they would be filed on the ticket system, which may not be great news for verification folks.

Hanna: I cannot believe how much productivity those guys can deliver when sitting together — different architects, the designer, the synthesis person. This is huge.

Davidmann: AI is going to have dramatic input on the way we do design and verification. Every now and again we try out some of this AI stuff on my team to see if it can improve something or analyze a bit of code we’ve written, and we are astonished at even the tiny little bit that we give it and how well it can help assimilate things for us. My belief is that we need to raise abstraction for design and verification. I don’t know how to do it exactly. But I see that AI can become my assistant, telling me how to do things better. I don’t know quite how to do that yet. I’m still learning about it. But I’m absolutely convinced it’s going to change the way we do design and verification. For my specific challenges around processor modeling and verification, I believe in a few years it’s going to be very different and much more efficient, because we can get these huge engines to help us in a constrained way.

SE: Doesn’t that open a can of worms about how we’re going to need more verification until we know what we are drawing upon in order to train models that can be trusted?

Davidmann: Yes, they hallucinate all the time. You have to get used to where they go to understand how much you trust it.

Darbari: It depends on what verification targets we are talking about here. To use AI/ML, we need large sets of data, and these may not be available at least for formal verification, though there is certainly a sweet spot for legacy simulation testbenches. Having said that, for running formal regressions on the same design over a period of a few months, formal tools do have a way of harnessing enough intelligence from data to learn what can be done better and faster on the same property set, i.e., if the design and the test environment is not changing significantly. For things like linting and autochecks there is a potentially huge opportunity for leveraging AI/ML, but I’m not sure this is the biggest challenge for verification.

Pandey: When you build these tools, at least with the current version of LLMs and generative AI models, there has to be a human in the loop. EDA tools and verification tools have a big role to play as to how you can perform automated checks of what’s being generated. This definitely is going to increase our productivity level, and how designs are altered and perhaps combined. But the bigger problem actually is that once you alter the RTL, for example, once you alter the IPs and AIPs and other things, it still kind of goes down this chain of implementation and synthesis and physical and mask creation and all of that. There are localized algorithms that can help there, but that is still a challenge as to how to solve those problems with the current technologies we have. When we talk about bugs, we can have ‘perfectly’ working hardware, but the hardware is designed to run an application. You have a software application that runs out there, and then we have to think about the whole stack. If we want to really shift left bugs, we have to think in terms of whether there are errors in firmware, or errors in the application software, or the OS that runs on it. Today’s design teams cannot think in terms of the entire hardware/software stack in the majority of bugs, and that is where the next advance has to come in so we can build systems faster and with fewer bugs.

Davidmann: Just to add to that, when we started our company some 15 years ago, it was because we kept seeing people put in small processing elements instead of having RTL in the data path. And when we asked, ‘Why are you doing that, rather than with the state machine, because you’ve got a little processor in there,’ they said it was because then they could put the state machines in software. So if they have bugs or changes, they can just change some firmware. More and more, a lot of the chips became small processors dedicated to the data paths, and that’s where a lot of the RISC-V activity has come from. Instead of building custom processors, they’re now trying to use a more standard approach of extendable custom processors. I absolutely believe a way of having fewer bugs in the RTL is to have more of it in firmware and software, where you have different state machines, which are actually dedicated processors — effectively like little hardware accelerators. That methodology is one we’re seeing a lot more of now, where there are a lot of processors, there’s a big application, but there are all these other things around. So you’re essentially pre-verifying that block, because it’s a pre-verified core that you’re putting in, and then you can change the firmware and the software later to change the functionality or to fix bugs in it. It’s easier to fix a bug in software than it is at tape-out. That is a change that we’ve been seeing over the last 10 to 15 years. We’ve moved to putting more into firmware and software rather than RTL.

Hanna: I believe there will be a big change in the CAD tools going forward in the presence of AI. Today the majority of the CAD tools are gigantic tools that run in batch, spend hours spinning things and producing huge log files. Having the tools and the AI in the loop means the expectations from the tools will be much leaner, and will give direct feedback to the AI to produce the right decisions. I believe we will see a wave of pocket-like tools, like small synthesis, small formal verification, small simulation, and they’ll be tiny, to be able to help the AI and work in an interleaving way to steer the decisions of the design going forward. So there is an evolution in the way CAD tools are hooked together, and the level of openness they deliver in terms of API, interactivity, and feedback. This is going to be a challenge, especially for old and gigantic tools. Now we’re going to be leveraged in the AI domains, but I believe most of us are adapting to the new reality. AI will be a significant and very important tool, but it is not going to replace the engineers and the hardcore algorithms, synthesis, timing, formal, and so on. However, this knowledge will be integrated together, with AI as the orchestrator for the path, the workflows, and oracles. The wisdom of engineers and the tools will fit between to help deliver much more productive workflows in terms of verification, simulation, formal, or whatever design flow that you would like to talk about.

Henson: EDA tools have exceeded the capacity of the engineer to digest what we’re telling them. We’ve got all these log files and all these details popping out, such that, ‘Early in the process, I can ignore this, but at the end of the process, I can’t.’ Things can get missed all the way down the chain. Given this, what we’re going to see is AI wrappers on these tools, or even two tools at once, that will coach the engineer and say, ‘No, you can’t move forward until you actually resolve this problem and clean this up.’

Read part one of the discussion:
Engineers Or Their Tools: Which Is Responsible For Finding Bugs?
As chips become more complex, the tools used to test them need to get smarter.



Leave a Reply


(Note: This name will be displayed publicly)