Push to 7nm and beyond, as well as safety-critical markets, raises the stakes and hurdles for finding design issues. New approaches may be necessary.
By Ann Steffora Mutschler & Ed Sperling
Debugging a chip always has been difficult, but the problem is getting worse at 7nm and 5nm. The number of corner cases is exploding as complexity rises, and some bugs are not even on anyone’s radar until well after devices are already in use by end customers.
An estimated 39% of verification engineering time is spent on debugging activities these days. Verification as a whole accounts for roughly 70% to 80% of total NRE. Still, it’s getting harder to find these bugs as multiple processors and memory types are used to reduce power and optimize performance. Add to that more more functionality, an increasing number of possible use cases and models, and trying to predict all of the things that possibly can go wrong is almost impossible.
As a result, chip architects and engineers are now looking at new approaches to speed up and simplify debug, including continuous monitoring, error correcting strategies, and developing SoCs and ASICs that are inherently easier to debug.
“There are more processor cores, more power domains, and hardware deep-learning neural networks,” said Larry Melling, product management director at Cadence. “Debug of these systems will require more sophistication in root cause identification. Take low-power design for example. It introduced isolation and retention to the driver tracing problem, where a driver is triggered by a domain power down. That can obfuscate the root cause of a wrong value propagating.”
Indeed, as silicon geometries continue to shrink, SoC platforms on single devices become larger and more complex, reminded Dave Kelf, vice president of marketing at OneSpin Solutions. “The debug complexity of these devices increases at an exponential rate to design size. Furthermore, the error conditions that can occur may be due to complex corner case problems, which are hard to track down. Innovative debug techniques are required, and these might make use of unexpected alliances between different tools. For example, a fault that becomes apparent during SoC emulation, can be debugged using bug hunting techniques applied with a formal tool, with assertions being created that exhaustively analyzes the specific condition. The continued shrinkage of geometries essentially results in inventive and diverse combinations of tools, stretching their capabilities to unexpected requirements.”
Other considerations at leading-edge nodes include multiple processors and neural networks—in short, distributed computing—which increase the number of possible sources for errors such as memory corruption, while also making it difficult to find paths between effect and cause.
“All this will require a more transactional or programmer’s view of test execution grounded in the context of time to help identify hard-to-find race conditions between concurrent transaction execution,” Melling said. “Debug will have to rely less on tracing signals in favor of providing overall context of design state, transaction views, and timed views of messages, power state, and concurrent transactions. And presenting context from these different sources of information will demand smarter debug, with more sophisticated data analytics and visualization to find root cause of the observed misbehavior.”
Fig. 1: Debugging a complex chip. Source: Cadence.
Complicating matters is the increasing interdependency between chips, packages, boards and even other connected or sometimes-connected systems. The result is there is no single task known as “debug,” anymore. Harry Foster, chief verification scientist at Mentor, a Siemens Business, said debug now spans everything from architectural design, RTL design, timing, power, security, software interactions, even the verification test bench and manufacturing. So while engineering teams that added assertions in their RTL were able to cut debug time by 50% for RTL, that only addresses one piece of the problem.
Indeed, there has been much rethinking about the entire debugging flow due to SoC deign given the increasing amount of embedded IP. “I either go out and buy third-party IP or else I’m creating my own internal IP, and I start integrating it into this huge SoC with a lot of embedded processors,” said Foster. “Traditional debugging approaches no longer work. Part of the problem is that we are now faced with complex interactions between these IPs that quite often we didn’t expect, and traditional ways to get insight into these interactions don’t work.”
This includes more than just functionality. In the past, debug was about making sure a system functioned correctly. Increasingly, though, what is considered functional is more of a sliding scale that is defined by the IP, the end market, and what are the most important features within a design. So for a smartphone, not all features have to run at optimal power or performance. Some might even be disabled. That’s not acceptable in safety-critical markets such as automotive, medical and industrial, however, which add a whole new set of challenges for debug.
“If you re-use IP, it may have low, mid- or high performance,” said Kurt Shuler, vice president of marketing at ArterisIP. “This gets more difficult as you add more subsystems, or if you change how they are connected. Re-use for a lot of companies is based on the idea that, ‘If it ain’t broke, don’t fix it.’ The reason is that if you change the IP significantly, you have to change the software. But as we get into safety and security, the needs of customers are changing, so now they’re going back and changing the internals.”
In avionics, which has been the model for safety-critical design for automotive and other markets, third-party IP has played only a very minor role because it doesn’t have the necessary documentation. It’s easier to test and debug IP developed for a specific system. But development times in Avionics are much longer than in semiconductors, which has made this less of an issue in that sector.
“One of the main approaches in this market is to reverse engineer IP,” said Louie de Luna, director of marketing at Aldec. “That requires a rewrite, but it also required more test, more documentation and more design. The big avionics companies typically don’t buy commercial IP. They make their own.”
Big data and machine learning
This shift to more customized solutions as defined by market segments is accelerating, and it has big implications for debug. Devices sold into safety-critical markets have to function for much longer than a consumer or mobile device, often within defined parameters of what is considered acceptable performance. That greatly increases the need for better coverage rather than relying on an after-market software patch. And increasingly it requires what, in effect, amounts to debugging of the debug process.
“Coverage is done to see if we actually tested this,” said Foster. “If we didn’t, we have to debug and figure out why we didn’t. Is there is a bug in our design or a bug in our test? Those fall apart and don’t work in very complex SoCs, so we’ve had to develop new solutions in that space. Basically, it’s come up even with new metrics, and these new metrics are basically statistical in nature and they leverage data mining techniques.”
Big data techniques are playing an increasingly important role here, both with data mining and machine learning techniques that can utilize those findings across a wide swath of designs. The goal is to identify which approaches work best in order to reduce the overall time spent on test, verification and debug—basically, prioritizing how to get the most return for effort spent.
“The goal is to apply the best tests to get the most coverage, and then to use tests with incremental coverage later in the development process,” said Bill Neifert, senior director of market development at Arm. “You need models for this because there is never enough testing. There is always an acceptable number of outstanding bugs and an acceptable number of verification cycles. However, if you can free up more cycles you can run more tests, providing more effective coverage. So basically, you’re developing better tests, and using the extra time to do more verification.”
There are a number of ways to approach this problem. One is to extract the needed data out of emulation or simulation in a smart way. So newly developed technologies will address these new problems where there are very complex, high integration of a lot of IP blocks.
“When we start going to 7 and 5nm, the complexity explodes,” said Mentor’s Foster. “It’s not just the manufacturing aspect, but the fact that now in design we can integrate more and more. Now we have more interactions, so again, it’s fundamental to rethink this thing from more of a statistical way of looking at coverage at a system level by using data mining techniques to extract and do the analysis. In the process of doing this deep analysis and getting this insight we never had, we actually can uncover architectural bugs that we would never find using traditional approaches.”
Rethinking the problem
For example, with many SoCs today, the tests are not done in the traditional way where vectors are put in. Those tests actually are embedded software running on one of the processors. Debug must be re-thought here.
“We need a new debugger that is synchronized across the processors in there, across the traditional waveform, the memory image, the register map,” Foster said. “All these things have to be rethought, and the information has to be presented in such a way that it doesn’t overload the user, so there has to be some intelligence to abstract out what is needed to present to the users how they can efficiently debug it.”
Research is underway in academia and industry to automate root cause analysis. David Patterson, the Pardee Professor of Computer Science Emeritus at UC Berkeley, stressed that the project team needs visibility into what’s going on in the chip, including some standard interfaces that can be included, such as jtag and others, or debugger interfaces for instruction sets. “You’d want to build that in because you want to try and understand what’s going on.”
Patterson pointed to a soon-to-be-presented paper from Berkeley he believes is novel in terms of trying to understand what’s happening here. “One of the really great pieces of technology if you’re building computers is an FPGA, because you can have a soft design in FPGA. People are reluctant to put in debugging features in their chips because it takes up area, uses power and design time. In an FPGA, it doesn’t matter if you leave it partially full or fill it all the way. You still can take advantage of it. The novel idea that some Berkeley students [recently wrote about] is to put two RISC-V processors on the chip in an FPGA, with the first one running ahead of the second one by, say, 1,000 instructions. In Verilog you can write an assertion that signals a flag if something happens. The bad part of that is you’ll just say something went bad, but you don’t know what happened. And you can’t record everything that happens all the time because you’ll run out of memory. It goes too slowly. The novel idea is to have two processors on the chip, one 1,000 clock cycles behind the other, and the job of the first one as a scout is to say, ‘Something is going to happen, so start recording.’ And the other one, which is in lock step behind it will catch up and the trigger will happen, but you’ll have recorded the last 1,000 instructions right before the event happened. That’s a novel way to try and add very detailed debugging information from seeing deeply what’s going on inside a chip using FPGAs.”
Another approach is to constantly test chips and to add in self-correcting measures.
“With automotive, you need to test chips in the beginning and in-system,” said Dave DeMaria, corporate vice president at Synopsys. “So you’ve basically got the equivalent of ECC on buses, and if you flip a bit, it can correct itself. There are also passive things that might not get corrected, such as whether a signal is valid or not. That might put the signal into the main processor, so it might be operating in a safe state. Every safety-critical system has failover mechanisms. If the middle of the die has contacts that are only 25%, it could break down after three or four years. If the chips are self-testing, that should flag itself that the contact is broken and the car puts itself into a safe state.”
The push to customize
This effort to build in some of the testing and debug is being done partly for a different reason. The trend at the leading edge of design, which includes AI, server and machine learning systems as well as mobile devices, is more toward more customized silicon. Even so-called standard IP is tweaked slightly, and debugging incompatibilities in many cases involves starting from scratch.
“At 28nm companies would ask whether IP was proven in silicon,” said Anush Mohandass, vice president of marketing at NetSpeed Systems. “But at 7nm, most IP blocks are not proven. And each different market segment wants something else. What you require from one memory is different from one market to the next.”
This plays into how companies view debug, as well. In markets such as servers and automotive AI, for example, debug is primarily about reducing bottlenecks and anything that might impact performance.
“From an EDA point of view, you may looking for an extra 2% or more out of timing from your design.” said Mike Gianfagna, vice president of marketing at eSilicon. “But if you look at this holistically, you can get much more. If you look at data centers, the challenge there is that they want to monetize the data, and to do that they need to decrease the risk and increase the efficiency at which they execute. If they run out of disk space, that can’t happen. So we’re starting to see software-defined data centers, where all the network and disk are under software control. They can change the network bandwidth and connectivity, but they still need to see the problem coming—and that’s where the challenge is.”
Fig. 2: IBM’s newsroom. Source: IBM
In effect, this moves debug up to the system level, and this is where the challenges and opportunities are greatest.
“The problem with heterogeneous SoCs, when trying to debug the whole thing holistically, is that it’s hard to find tools that allow you to get visibility into what’s going on, and control it,” said Simon Davidmann, CEO of Imperas. “People don’t have the tools because the ARC tools will be based on ARC, the Tensilica tools will be based around Tensilica, the MIPS tools around MIPS, and the Arm tools around Arm. Put a chip together with two different vendors on it, and the tools are in different buckets, which means you have to start one processor, and then start the next one. Then, you stop this one, and then stop the other. You end up with a can of worms in terms of synchronization. It’s technically quite difficult, but it’s made impossible by the fact that the vendors have different tools, and they talk different languages, effectively.”
Davidmann believes the core providers have no interest in supporting heterogeneous design in this way. “If you’re selling an Arm core, you worry about having the world’s best debug environment for the Arm core. If the customer goes and puts a Tensilica processor next to it, what do you care? Technically you should care if you’re a service company providing solutions, but they’re not. Arm licenses the IP. It’s slightly different for Synopsys that says they are the silicon design partner and can do everything. But what happens if it’s got a Tensilica in it? The problem is that for the user that is designing a chip, which is heterogeneous by vendor, which makes it heterogenous by architecture means that basically the vendors’ tools don’t all work in unison, and it’s a nightmare for them.”
Conclusion
Debug is becoming much more complicated at a time when it is also becoming much more critical. Designs are more customized, and they are being used in markets where safety and security are critical.
Tool engines and methodologies are getting faster. They are scalable across data centers these days, which helps. But not every bug can be caught in a reasonable amount of time, and the price of post-manufacturing failures is too high in markets such as automotive or even smartphones. As a result, chipmakers are beginning to look at error correction approaches and continuous testing even after devices are in use in the market. Put simply, they are debugging their approaches to debug, and how that ultimately shapes up could well redefine this process in the foreseeable future.
Related Stories
Debug Becomes A Bigger Problem
EDA companies have been developing more integrated debug flows that bring execution engines and hardware and software closer together, but is that enough?
One of the main problems is that specifying chips in RTL is too low level. At this point we should have moved on to a more data-flow approach that doesn’t include the clocks in the design spec. RTL hides a lot of intent and slows done functional verification. A good asynchronous/data-flow specification can be converted easily into synchronous (or asynchronous) logic with correct-by-construction techniques.
Unfortunately the big EDA companies are wedded to the RTL flow and are happier to sell you patch-up tools rather than fix the problems properly.
As Dave DeMaria says – you need to test systems as well as chips. That implies you don’t want SV/UVM, you want a C/C++ methodology.
I’d be betting on some open-source stuff fixing the problems, doesn’t seem like the incumbents are making progress.