Finding the root cause of problems becoming more difficult as systemic complexity rises; methodology and different approaches play an increasingly important role.
Debugging and testing chips is becoming more time-consuming, more complicated, and significantly more difficult at advanced nodes as well as in advanced packages.
The main problem is that there are so many puzzle pieces, and so many different use cases and demands on those pieces, that it’s difficult to keep track of all the changes and potential interactions. Some blocks are “on” sometimes, some are “on” all the time, and some of them are sharing resources in ways that are difficult to understand. In effect, there is a combinatorial explosion of quasi-independent subsystems, and each of those blocks interacting with the others creates a different set of opportunities for misunderstandings, miscommunication, and a slew of other problems.
“The issue is one of systemic complexity,” said Rupert Baines, CEO of UltraSoC. “In the days when we might have one core running one piece of software, issues were very bounded. But when you’ve got a hundred different IP blocks all interacting, and dozens of processes of different flavors from different vendors—all running software written by different teams, all interacting, all tied together by interconnects and NoCs of staggering complexity—that’s where we now have a whole new class of problems. It’s a world of different issues that never existed in the past, and the verification and validation tools that exist today simply cannot find those things because they are not looking at the system as a whole and are not looking at those interactions.”
Zibi Zalewski, general manager for Aldec’s Hardware Division agreed. “Advanced nodes and design complexity significantly change the way debug is approached and performed. It is no longer software OR hardware debugging; it is more software AND hardware debugging usually running in parallel and synchronized. The complexity grows even farther since the number of nodes is much bigger, all coming from different sources and vendors. It is not only about software code debugging or hardware module probing, since these complex systems are the integration of different elements and domains using interconnects. As a result, it becomes necessary to use bus monitors and analyzers to record the transactions and detect the protocol misbehavior. It is now standard procedure to analyze the AXI bus, for example, to actually track the problem in different sections of the system and narrow down its source. Increased complexity of the debugging process influences the speed of the verification environment, which leads to emulation platforms that offer simulation-like debugging capabilities, but at much higher execution speeds along with various operational modes. This is worth considering especially for the most complex projects — not only measured in size, but also architecture.”
Debug itself is changing to handle this, starting with the way test is performed and what engines or platforms are used for that testing.
“It’s a set of dominoes that are basically falling,” said Larry Melling, product management director at Cadence. “That first one is how we test. The latest developments with Accellera, putting forward the new Portable Stimulus standard, is really looking at some of these system level complexities that have to be tested. How do I do testing? You have a bunch of concurrent behaviors and coherency and power and security. There are system-level concerns that you don’t really address at the IP testing level that people are now looking at because they have such impact at the final product level. That naturally falls into the more complex testing and the need for stronger platforms. This impacts the simulation, and brings in the use of emulation and FPGA platforms in order to be able to complete this testing in your lifetime.”
The problem is not the tools. It’s the size and complexity of the designs and the methodologies used to create them.
“The design tools that we use are getting phenomenally good at electrical and circuit design, and the verification in that domain is now outstanding, so it’s very unusual that a bug will be an electrical issue or a circuit issue or a logic issue,” said Baines. “What we put down on a chip is what we intended, and it will work in the way that we intended.”
But even with well-understood tools such as emulation and FPGA platforms, things quickly can go awry because of the complexity and size of the chips being tested.
“Interconnecting all of these pieces has become very challenging,” said Krzysztof Szczur, hardware verification products manager at Aldec. “After you partition, you may not have enough FPGA I/O because you must add chip-to-chip into this. And with general-purpose chips, part of the design is software and hardware. On top of that, if you look at automotive and AI designs, the design size is too big for one FPGA. A neural network from a hardware point of view is different structures.”
And this is where different approaches to these problems really begin to make sense.
“If you look at what’s changing in the way we debug, it’s artifacts of this,” said Melling. “Because it’s fast platforms, because it’s long test, and because you need performance, you’re also looking at the cost of recording information in order to be able to do debug. There are those kinds of issues, and we’re looking at more record versus compute, in terms of getting the behaviors and the data out from the tests in order to be able to debug.”
The debug process also is moving from an interactive approach toward more post-processing because no one wants to tie up these fast platforms. If the data can be taken offline and worked with in order to do debug, that’s a better use of the platform. It’s also less intrusive than an interactive debug approach and prevents the problems from being moved around.
“You can use this information in a closed loop sort of way,” Baines said. “Just like the shift left in the pre-silicon world, you can use these tools in the integration and bring-up phase as a shift left for getting your chip to the customer faster and getting ramp into volume faster. So there’s a mirror in that sense, the shift-left aspect. But there’s also a closed-loop aspect. Those two domains don’t have to be decoupled, so you can be using a lot of the ideas—things like Portable Stimulus in the pre-silicon design phase, the virtual world—and extend them into the test world when you’ve got your chip back. But you also can take real-world measurements and genuine information and feed them back into your system models and the architecture stage of your chip to improve the modeling and analysis for the next generation.”
Fig. 1: Searching for problems post-production. Source: Intel
What is the best approach?
The rule of thumb is that larger and more complex a design, the harder it is to debug.
“This is primarily due to sequential depth—the number of cycles between the design’s inputs and outputs,” said Sasa Stamenkovic, senior field application engineer at OneSpin Solutions. “A design may have long pipelines, deep queues/FIFOs, or internal memory where data can sit for many cycles before being used. In such cases, a change to the design’s inputs can take hundreds or thousands of cycles to propagate changes to the outputs. In a traditional simulation testbench, in which only the inputs and outputs of the design are connected, tracing an incorrect output value back to the source design bug can be challenging.”
How to best approach this problem isn’t always obvious, and it frequently varies from one design team to the next.
“System-level debug used to be more a validation team’s concern, but the continued push to Shift Left is causing verification teams to look more at the system-level or SoC-level requirements,” Melling said. “And because complexity is saying this stuff’s going to be a lot harder to test, you need the kind of discipline that verification engineers have for doing metric-driven approaches—really having a test plan, knowing what you’re testing, what kind of results you’re getting, and what level of quality that you’re going to be able to produce. A more structured approach to testing at the system level would help tremendously here, which is one of the big drivers behind Portable Stimulus.”
Another aspect is having cleaner tests. Putting an entire OS and application software onto a platform and trying to find a bug in that is much harder than directed coherency testing or power testing while doing many things in parallel. “It’s a directed test environment that you understand,” Melling said. “If you can reproduce the bug in that kind of environment, you have a much better chance of debugging. So that’s one of the other things people are doing to debug these things—getting a view of the characteristics of the bug and then trying to write directed tests to reproduce it so they can get to root cause much faster.”
This gets harder at each new node, though, and as more heterogeneous elements are added into a design.
“As we got to finFET and double patterning, it looked like the sky was falling,” said Joe Davis, product marketing director at Mentor, a Siemens Business. “There are companies already working on designs at 5nm, so the challenges keep going up. There is a major shift from 16 to 7nm, not just from the technology standpoint but what semiconductor companies are trying to do with that area. The amount of area that you have available is fixed, but with each technology node you can put twice as much stuff in there. So just the sheer scale of what they’re putting into these chips today is causing them to make changes in their design flows.”
Design cycles basically remain constant because those are set by what consumers are willing to pay. “Those things are set,” Davis said. “You can’t hire an infinite number of people, so you have to look at your methodology. Where we see changes are a couple of areas, including reliability, ESD, electrical overstress, all aspects of reliability, which is accelerated by the automotive trend with the autonomous vehicle and ISO 26262. This is driving a lot more rigor and redundancy into electronic systems. So at the same time we’re increasing the amount of electronics in the cars, we’ve increased the requirements on those electronics for reliability. People used to do a lot of these ESD checks by hand. The foundries had rules in their manuals that they couldn’t enforce because there were no EDA tools to check them, so they were done manually. Well, there’s only so much you can do when you’re talking about billions of transistors and trillions of polygons. There’s not enough rulers out there.”
That also makes it much harder to find the cause of problems.
“When you’re debugging an issue, you’re trying to get to the root cause of a problem,” said Vaishnav Gorur, product marketing manager for the verification group at Synopsys. “This involves multiple factors, including the design team’s expertise, and the expertise in the software that you’re using to verify your your designs. This is complicated by the fact that designs are getting bigger, more things are being put in, there is a lot more functionality, and a lot more diversity in the design—such that it’s not completely built within the company. You have IP that may have been developed by other teams within the company. You purchase IP. There are a lot of different components that go in. And now, as designs get bigger, there is more consolidation happening. What was potentially two completely different designs with two clock or power domains, those are coming together along with mixed signal. This means the failure signature is getting longer, so it may go across multiple areas, some of which may be unfamiliar to the person doing debug. Therefore, it takes much longer to get to the root cause of the problem.”
As a result, engineering teams are using different verification approaches to find the root cause more quickly.
“Simulation used to be the work horse of verification in the past, but now more and more companies are adopting static techniques, formal techniques, so the number of different approaches you can take there is increasing,” Gorur said. “Some of them help get to that root cause faster. You can catch some of these bugs early on without even having to get to a testbench. You just run your design through your static checker. Maybe it’s a lint checker. Or maybe it’s a CDC or UPF checker. Whatever it is, these static checkers are able to catch a lot of bugs faster. The moment you take out the low hanging fruit, the less time you spend on debug later on when you’re doing simulation because you’re not encountering those those types of bugs anymore. You’re catching more of the corner-case issues and things like that.”
Another approach can include formal verification, where corner case issues can be identified instead of running exhaustive simulations. But debug must be tailored to each of these verification approaches, he noted.
In the past, design and verification teams sometimes added temporary monitors within the design to detect errors closer to the source and reduce debug time. “Detecting a FIFO overflow when it happens is more efficient than waiting for corrupted data to appear on the design outputs” OneSpin’s Stamenkovic said. “Assertions are now used to perform a similar function. Assertions are automatically excluded when the design is synthesized, so they are more convenient than hand-inserted monitors. Both automatically generated and user-written assertions are checked continually during simulation, reporting any problems earlier and easing debug.
Formal verification reduces debug time even further, he said. “If a formal tool finds a way to violate an assertion, it presents a minimal-length test case showing how the violation occurs. Most formal tools can export this test case to simulation. Debug happens in a familiar environment, but with less effort due to the focused nature of the test. Similarly, formal equivalence checkers can show precisely any disagreement between two designs that are supposed to be functionally equivalent. This is easier to debug than two sets of simulation tests with different results, and also exhaustive.”
Other issues
Chipmakers are using more commercial IP and interconnect fabrics in complex designs, with standard interfaces they use to connect to their own secret sauce. Not all of the pieces go together so well, however.
“They tweak that interconnect to extract the maximum performance out of it,” Gorur said. “But guess what? If you don’t actually measure the performance and make sure that, in addition to functionality, performance targets are being met, then there’s no way of telling it will meet those targets. You want to be able to measure that performance.”
This is particularly important for interconnects and protocols. “At the subsystem level and at the SoC level, once the chip comes up to that point, there is significant interest in being able to define certain metrics, measure those metrics, and make sure that those thresholds are constantly being met as part of the design,” he said. “For example, within certain tools, performance analysis gives engineers a way to define and measure, visualize, analyze, and debug those performance metrics. They can say, ‘I need the bandwidth on this particular channel to be such and such,’ and they set those thresholds. What the tool will do is run the simulation and collect all the data and say, ‘You have a thousand transactions running through this fabric. The latency was always below the threshold that was specified—except for these five transactions.’ So out of the thousand, five of them exceeded that threshold, and now the verification team can dig deeper into what those are to figure out what the root cause of those performance related failures are, which are not true failures, but failures as defined for performance.”
How much of this can be headed off up front has been debated for years. “It’s not just building an IP and letting the SoC guys build it together, but have some kind of a team or tools or infrastructure that builds systems with the knowledge of how all the various IPs work together,” said Rajesh Ramanujam, product marketing manager at NetSpeed Systems. “Otherwise, if you just debug everything by itself, it becomes a harder problem to solve at the system level.”
Debug at 7nm and below
At 7nm and beyond, the debug challenges are not entirely clear, especially when it comes to trying out new technologies. “Nobody really wants to be the guinea pig,” Ramanujam said. “Nobody really wants to be the first person trying out something new. One might think they had known all the issues that there are to uncover, but only when you get in do you find there are more unknowns. As you move to lower nodes, a lot more complex things get added in terms of domains having to cross. And some applications are more prone to them than others, such as AI and automotive. Those customers need a lot more performance, a lot more real-time constraints, and it just makes the job that much more difficult. If something goes wrong or needs to be debugged, or you have to figure out what the issues are, how do you do that at a system level? The question really is what’s required in the industry to solve things like these?”
Being able to observe the various pieces and how they interact is critical here. “If you want to see every little piece, there’s going to be a lot more information for you to digest, but then you have to present that in a very methodical and thoughtful way,” he said. Then, having a modularized approach—as opposed to a haphazard way of putting IPs together, or even building IP so that it’s very structured—will provide scalability and a consistent methodology. If you take that approach, whether it’s a 10-IP SoC or a 200-IP SoC, once your methodology is modularized, then the debug approach will automatically become scalable.”
While there isn’t one debug approach for every design and every design team, a variety of technologies can be employed today. But it’s still getting more difficult to identify the root cause of problems at each new node and to optimize performance in a sea of complexity.
Leave a Reply