The Problem With Post-Silicon Debug

Rising costs, tighter market windows and more heterogeneous designs are forcing chipmakers to rethink fundamental design approaches.

popularity

Semiconductor engineers traditionally have focused on trying to create ‘perfect’ GDSII at tape-out, but factors such as hardware-software interactions, increasingly heterogeneous designs, and the introduction of AI are forcing companies to rethink that approach.

In the past, chipmakers typically banked on longer product cycles and multiple iterations of silicon to identify problems. This no longer works for several reasons:

  • Costs continue to rise along with complexity, but the markets that support billion-unit designs are flattening and becoming more competitive. The result is that chips are being produced in smaller volumes, often with higher reliability requirements.
  • Designs are increasingly heterogeneous, incorporating multiple compute elements and memories rather than a single processor.
  • Time-to-market pressure is rising in all segments, including automotive and industrial designs.

Put simply, chipmakers are under pressure to do more, in less time, for the same or less money. But that arithmetic certainly doesn’t work when it involves the same approaches they have used in the past.

“Many years ago, it was maybe $10 million to get to market, and then it was $50 million, and now on an advanced node you might be talking about $200 million,” said Rupert Baines, CEO of UltraSoC. “People say it’s the cost of the node or it’s the cost of the processing, among other things. But actually, while the mask costs have gone up, it’s not by that much. On top of that, if you look at those as mask costs or EDA costs on a dollars-per-transistor basis, it’s actually trending down on a Moore’s Law-esque curve. The reason the product cost is shooting up is integration and software.”


Fig. 1: Verification costs are growing exponentially with increasing complexity. Source: Accellera

Software is one of the main culprits. “In these bigger and bigger systems, we’re doing more and more,” Baines said. “The amount of software in them is going up, and that costs money. The amount of integration is increasing unbelievably. Everyone talks about tape-out as though tape-out were the end of the story. It’s not. It’s an internal milestone. Tape-out doesn’t mean anything to your customers. It doesn’t mean anything to your revenue stream. It’s purely a marker on a project, and it’s roughly halfway through the project. We’ve had years of discussion around ‘Shift Left,’ which all of the EDA vendors talk about as a way to accelerate tape-out, but nobody has been talking about the second half of the project, which is where the costs are increasing astronomically. From a financial point of view, that’s where you need to focus all your energy. The side before tape-out, where, on a constant basis costs are falling, that problem has been solved to a degree. It’s the side after tape-out where costs are rising, and schedules are slipping. So it’s all about the half of the project after silicon comes back until time to revenue. It’s not time-to-tape-out, it’s time-to-revenue that matters.”

There is widespread agreement that post-silicon debug challenges associated with finFET-class ASICs and advanced packaging such as 2.5D are daunting. Moreover, they are exacerbated by increasingly heterogeneous designs, which are seen as an increasingly necessary way to compensate for diminishing power/performance improvements from node shrinks, as well to improve performance for AI/ML/DL.

“The two most important ingredients to success are preparation and collaboration,” said Ajay Lalwani, vice president of global manufacturing operations at eSilicon. “There are many potential, subtle interactions between a collection of IP from multiple suppliers, the semiconductor process that made them, the 2.5D package and associated HBM memory, and the system firmware. Successfully bringing up a design like this demands a multidisciplinary team from all the key ecosystem players IP, fab, packaging and customer.”

Lalwani notes that waiting until silicon arrives from the fab to assemble this team won’t work because the team needs to begin analyzing scenarios and preparing for the complex task ahead months before the chip arrives. “That’s the best way to stay ahead of the process and its challenges. If all members of the team are prepared and willing to share information to solve problems, things usually go well.”

The business of technology
From a business standpoint, though, getting a chip to work isn’t necessarily the only consideration. The economics of chip development are dramatically different for billion-unit mobile phone chips versus chips that are developed for narrower market segments and produced in smaller volumes.

“Thirty years ago there was a controversy about BiST,” said Baines. “Obviously, it is worth doing. Nobody even thinks about that now. But in order to do it right, you want to have those discussions right at the very, very beginning. And you have your DFT experts contributing to the design, contributing to the architecture very early in the design process. If you do that, then the DFT gets integrated efficiently, BiST will work brilliantly, and you get all the benefits— economics, yield, and all the rest of it—because you took the decision early in design process. We need to be doing something similar about integration, verification, validation and the post-silicon stages.”

He said that post-silicon debug is broken, as evidenced by the explosion in costs. “If we had tools and techniques that were working well, then just like in the pre-silicon phase, we’d be seeing costs falling on a normalized basis. Obviously, a bigger chip costs more than a smaller one, but [rather than] dollars per transistor, we should be seeing dollars per thousand lines of code or dollars per transistor post silicon falling, and they are not. That proves existing methodologies are broken, which comes from the fact that they make lots of assumptions. Nearly all the design methodologies at the moment basically assume there’s one processor, or if there are multiple processors, they assume that they are independent and you can treat them as though they were only one processor. They don’t take any consideration about scalability of the number of processors or the interactions of the number of processors. That’s a problem that’s left for the human engineer to solve using his or her brain cells. It’s not in the methodology or the architecture.”

This becomes more difficult as chips become more heterogeneous, and it begins to stress some of the tried-and-true approaches for design.

Simulation has long been the primary verification workhorse, first for directed tests, which test expected functionality and as many out-of-specification behaviors that the test creator can conceive,” said Pete Hardee, director of product management at Cadence. “Then simulation evolved to cover variations of these scenarios using constrained-random testbenches.”

Now verification engineers broadly supplement simulation with formal verification. Formal has the advantage of trying every possible combination of inputs to mathematically prove or disprove a facet of functionality captured as a property—most commonly as a SystemVerilog Assertion (SVA).

“Traditionally, trying every combination in order to achieve full proofs of these assertions has limited formal verification’s application to smaller and ‘formal-suitable’ (often control-dominated) blocks with smaller state spaces, where formal could replace simulation for unit-level tests,” Hardee said. “But now, new levels of capacity and techniques to drive much deeper into the state space of bigger IPs and subsystems, and to find counter-examples that show where assertions fail, are being widely deployed in parallel with or after completion of simulation. These ‘deep bug hunting’ techniques are able to uncover corner-case bugs even in designs where simulation has completed with good coverage metrics.”

Different approaches
Formal verification is a proven way to uncover high-value bugs in what are considered to be fully verified designs. The trick in formal is to narrow down exactly what you are trying to do, and that isn’t always easy.

“One of our customers was trying to verify 1 million connections at 7nm,” said Raik Brinkmann, CEO of OneSpin Solutions. “The problem is that it took 24 hours to verify one connection. What we found was that it was better to rebuild the application from scratch and re-think the problem, rather than trying to use what was already available. As a result, they were able to get the time it took to verify connections down to 23 seconds. They were able to narrow down the problem to specify the connection, and they used patterns to do that verification.”

And this is where some of the biggest changes are required. Building chips is no longer just about using a classic von Neumann, single- or multi-core processor architecture.

“There’s more attention spent on figuring out the optimal architecture,” said Aart de Geus, chairman and co-CEO of Synopsys. “If you have a bottleneck in the architecture, you can’t just put this all data into cache. These architectures can be explored. If you change these ratios or buses and architectures and the amount of computation and external access to memory, what happens depends on all the parameters. Architectural tuning is becoming more important. But that’s another way of saying the traditional von Neumann machine is evolving to the next generation of machines, based on creativity of the people designing it.”

That creates its own set of issues, though because problems need to be addressed in context. “You have to optimize systemic complexity, which is multi-dimensional on a matrix,” de Geus said. “In software, there are the same problems, but it’s governed less by verification. If you do a tape-out on a chip and you find some bugs, and it comes back not working, you’re out $5 million to $7 million in mask costs and time. For software, you send out a patch. The discipline has been less rigorous. But as software complexity increases, the statistics are against you.”

By the time this reaches the post-silicon phase, this is a system with many moving parts, not all of which are precisely characterized or understood. Bugs can result from mismatching or misinterpreted specifications between hardware and software, or incomplete consideration of the system use cases.

“Other verification solutions are becoming much more broadly applied, and new ones are emerging, to address these challenges,” Hardee said. “One example of this is much broader use of hardware emulation to thoroughly test the hardware-software interactions, with enough performance and capacity to run sufficient software on a sufficiently accurate model of the hardware. A second example is verification of a broader set of system use cases, captured with the emerging Portable Stimulus Specification (PSS) and executed with new tools.”

Dave Kelf, chief marketing officer at Breker Verification Systems, agreed. “Post-silicon validation has been a classic silo in the verification process, where separate teams work on silicon diagnostics, often completely cutoff from the verification flow. This is not surprising given that current verification techniques are not easily ported to the post-silicon world. It is well-known that Portable Stimulus is targeting these issues, allowing for verification tests to be reused post fabricated. In reality, plenty more is needed to provide a complete flow for the validation engineers. The PSS tool also must incorporate post-silicon debug and coverage, which requires higher degrees of visibility into the silicon. The PSS tool must allow the engineer to watch tests run on the design and, by noting which tests fail from the verification suite, identify the internal component creating the issue. With the right tooling, post-silicon can resemble more general verification, with the debug power that goes with it.”

The post-silicon debug tool market is far from a new knot to untangle. Stephen Bailey, director of strategic marketing at Mentor, a Siemens Business, has been watching this space for nearly a decade, but he said chipmakers are often reluctant to adopt new tools and methodologies. “It’s easier to get people to use anything in a prototype, but a lot harder to get people to use something in their final silicon. The big challenge from coming up with a commercial offering is getting companies to go from what they developed internally to something that’s off-the-shelf. That remains a significant challenge. As far as having a commercial product, the challenge in the market overall is what kind of visibility you can provide in silicon, because when people get back their first silicon—or second or third silicon—if it’s not completely brain-dead but isn’t production-ready, you need to have some ability to go in there and figure out what the heck is going on. This requires more than just doing a system-level scan dump, which has limited value. It’s important, but it’s limited in value in trying to figure out what’s happening, especially if you’ve got performance issues that you didn’t expect. It’s really a challenge because clearly you can’t change what you have visibility of when you add in some kind of visibility capabilities without re-spinning the chip.”

That’s where post-silicon debug differs from the FPGA and the prototyping market, where the design can just be recompiled. “It might take a day or two to get new results, but that’s a lot quicker than three to six months to get a new chip back just for the purpose of getting better visibility,” Bailey said.

Providing that visibility requires a system-level strategy that includes a growing list of variables, including real estate costs, possible performance impacts that could be introduced by instrumentation, and even various types of variation. Debugging is affected by all of these variables, and providing visibility into these types of issues can have a big impact.

“Debug has always been thought of as the ‘red-headed stepchild,’” said UltraSoC’s Baines. “It’s always an afterthought. It’s always, ‘We’ll get Fred to deal with that,’ and ‘Can’t we just use JTAG?’ It’s never been properly considered as part of the business model and part of the cost driver. Also, because debug traditionally has been done by the core vendor, it’s been thought of as a processor problem, not as a system problem. Of course, the core vendors only care about their own core, and because it’s being thought about as a core-based problem, not a system problem, a lot of the tools and methodologies simply don’t exist.”

Conclusion
Even with the best planning, bugs can and do escape into silicon. Post-silicon debug, the traditional way, is no fun. It involves wading through very long JTAG traces and trying to perform root-cause analysis to find what caused the problem. The amount of data and poor debug visibility make it akin to finding a needle in a haystack.

Given the significance of the costs associated with a semiconductor design post-silicon, this is an area of the industry that will continue to evolve as designs grow in complexity and heterogeneity.

—Ed Sperling contributed to this report.



Leave a Reply


(Note: This name will be displayed publicly)