Experts At The Table: Who Takes Responsibility?

First of three parts: Preventing problems; integration issues; where bugs slip in; supply chain shifts and their impact; what the foundries are seeing at the end of the design process.

popularity

By Ed Sperling
Semiconductor Engineering sat down with John Koeter, vice president of marketing and AEs for IP and systems at Synopsys; Mike Stellfox, technical leader of the verification solutions architecture team at Cadence; Laurent Moll, CTO at Arteris; Gino Skulick, vice president and general manager of the SDMS business unit at eSilicon; Mike Gianfagna, vice president of corporate marketing at Atrenta; and Luigi Capodieci, director of DFM/CAD and an R&D fellow at GlobalFoundries. What follows are excerpts of that conversation.

SemiEngineering: As we start integrating more IP into SoCs and stacked die, who’s responsible when something goes wrong?
Koeter: IP is getting very complex. We try to prevent the problem from happening in the first place by developing quality IP and testing it out on shuttles. But it’s also about holding design reviews with our customers to see how they’re integrating that IP. We can advise them of any gotchas. We also offer signal integrity services. It’s not just the on-chip PHY. It’s also the package and the board traces, and channel characterization for SerDes. So we do try to prevent problems, but they do occur when the silicon comes back and they’re bringing up the chips. We’re a big enough organization that we can put people on site locally as we need to and help people debug problems. A lot of times it has nothing to do with the IP. It may be software issues or integration issues.
Moll: We don’t see issues with our IP, but we do see a lot of IP integration issues. The reason is that when you integrate many IPs, which most people do, you’re bound to have integration issues. You don’t see it with the IP, but as you start assembling your system you see it. Whether that’s the security address mapping, the performance, the power management—we see all of these things. The good thing about living in an ecosystem is that, at least for the main IPs, you get to know them. Application engineers know which IP does what and how it behaves. But there are small IPs we don’t see as often or internal IPs that are harder to know, so we try to make sure early on that all the parts are covered with verification and performance modeling. But we do see issues, and we do get called often even if it’s unrelated to us because we tend to live between IPs. On your path to the memory controller you usually encounter us. So we can figure out on which side there’s an issue, or whether it’s the whole system that doesn’t work as planned. We also have instrumentation to look at what’s going on. There’s a lot of debug visibility. Even if we’re not involved in the problem, we have the tools to look at it.
Stellfox: Looking at it from the verification perspective, it’s a big issue. Whether it’s internally or externally developed IP, the IP developers have no specific information about the SoC context they’re going into. And then the SoC guys have very little detail about the IP operation. They’re buying the IP because they want to use something off the shelf. From the IP side, the kind of issues you typically see are, ‘You weren’t able to model in your verification environment at that level certain things. You took assumptions that weren’t actually true and you plugged it into a specific system context.’ Or, you just didn’t do a thorough enough verification job. And then a lot of times the IP isn’t optimized for integration, so when you deliver the verification with it, the verification was built around the IP but it wasn’t built in a way that the SoC integrator can leverage. From the SoC side, people underestimate integration. They think it’s just integrating a bunch of IPs, but every SoC has specific power, concept, resetting, clocking and other logic around that IP that makes it very customized for that context. Somebody has to verify that. With this gap in knowledge, bugs slip through there. And the IP is usually not tested for specific use cases of the system. That’s the SoC person’s job. So there can be bugs in SoC performance and power bugs related to requirements of the SoC. What’s lacking at the SoC level is integration is very much a manual effort. It’s a lot of specs and connecting things. There’s an opportunity there for EDA to provide more solutions around integration.
Gianfagna: In terms of who takes responsibility, you follow the chain down to the least common denominator. That’s the customer building the chip, or in some cases the foundry manufacturing the chip. I wish I could say that everyone would share the financial responsibility if something goes wrong, but that’s not the case. So how do you avoid a problem? We’re not an IP provider. But we do use more and more IP. We do a lot of analysis on the RTL to determine what may go wrong—power, clock utilization, timing constraints. We can look at routing congestion and find obstructions. We can predict testability. And we do all of that at the soft level to figure out the baseline for quality management of IP that’s put on the market. We work with a lot of folks on that. We work with TSMC. There’s a lot of work to be done. But the good news is the industry is starting to come together on this.
Skulick: Who owns the problem depends on whether we’re dealing with the fabless semiconductor companies or an OEM. We’re the integrator and we cover for any of the issues that we create. Now we’re talking about $2.7 million mask sets. Ten years ago that wasn’t a big risk item. Today it’s a huge risk item.

SemiEngineering: At what node is it costing $2.7 million?
Skulick: 28nm. It depends on how many layers of metal and all the options. We’re doing some pretty large devices. Our problem has been more with the hard IP, which creates all-layer issues, not just metal fixes and RTL. The OEM customer says, ‘That’s your problem. I don’t know anything about that. I’m used to dealing with an ASIC company.’ It’s a challenge for them when we bring our model to the table and they understand they may have to share some of the burden with IP. A fabless company understands that. We may bring some services they don’t have, but they know the IP relationship. We don’t get the IP providers to pay for a mask set if we have to do a respin. They’re all pass-through terms. They’ll fix it, but they won’t pay for any hard costs. We’re in the middle, and explaining that to an OEM is much more difficult than explaining that to an FSC. We can’t afford to cover it if there’s an IP problem. We can get to the root cause of the problem. We’ve found some examples where it’s in the IP itself. We’ve had issues where it’s been our integration and how we connect it. Fortunately, what’s typical is the customer also has issues and they want to make changes and there’s a lot of divvying up of the cost.
Capodieci: What we see—and what we’re seeing with concern—is at the physical level everything is late or impossible to respin. We have to fix a lot of issues that were discovered. The same issues with power at a physical level—IP that has been design completely out of context with where it will go, either with fill issues or next to other things that mess it up. We’re not in the blame game. We’re in the fixing game. More and more we have to implement ad hoc custom fixes. Our engineering is getting tired of putting in patches and is going ahead and developing automated fixes. There is an opportunity, and some of the EDA vendors are providing physical engines that are helping. But this is raw engineering to take blocks that were developed in isolation, put them into context and guarantee coverage for multiple applications. That’s what we see in the trenches. Who’s to blame? I don’t know. I want to be blamed for the solution. But it’s certainly today’s problem at 28nm, and it’s going to go all the way to 10nm and beyond.