SoC Integration Mistakes

Experts at the Table, Part 2: Why making assumptions about use cases and thinking outside the box can create disasters—and where the pain really hits.


Semiconductor Engineering sat down to discuss integration challenges with Ruggero Castagnetti, distinguished engineer at LSI; Rob Aitken, an ARM fellow; Robert Lefferts, director of engineering in Synopsys’ Solutions Group; Bernard Murphy, chief technology officer at Atrenta; and Luigi Capodieci, R&D fellow at GlobalFoundries. What follows are excerpts of that roundtable discussion.

SE: Margin has an overhead in terms of power and performance that becomes unacceptable at advanced nodes, but with IP some of that is creeping back in. How do you justify that?

Lefferts: We have to understand the margins. You can’t just afford to throw up a number and say, ‘We’ll test for this condition, which is greater than the normal condition, because we’ve always tested there.’ You have to justify it, especially with low voltage because the margins can be very difficult to define and understand. For example, the lowest voltage a bit cell will run at is very difficult to determine without running a lot of silicon. Extrapolating from that to the rest of the memory and to the logic around that, and then figure out which voltage conditions are legal and illegal—understanding that level of margin is critical to making your silicon work. But in addition to that, there are standard margins that you still need to have where you say you are going to allocate for your set-up margin every problem you can think of, and you’re going to have this many pico-seconds for jitter and this many for clock skew, and then you’re going to add a couple more because you’re worried. Ideally you’d like to add more margin because we’re all engineers and we worry about such things, but it’s getting hard to do that. It’s a lot of work.

Castagnetti: I agree with the fact that margins are not free. They add area, they add power and they impact performance—and maybe all three of them. Ideally what would be nice to have would be an understanding of where else you can pay for it. So if you make your power system very robust and add daps (decoupling capacitors), the challenge today is that you don’t think there is a good understanding of what that buys you. Can you have a more variable margin that you can play with. That’s missing. If I had that, I would be able to make tradeoffs myself.

SE: So you’re talking about exploratory tools for margin?

Castagnetti: Yes. Sometimes IP providers are not forthcoming in telling you where they added margins. Sometimes it’s in their own interest in case things don’t work later. Margining is a challenge.

SE: And it becomes more of a challenge as we move forward, right?

Castagnetti: Yes. Silicon is more expensive today.

Lefferts: There are a lot of things that used to be done down the pipe that you have to do now, before you get started. For example, with fill, you used to build the IP and then fill it. And if there were any places where you had problems, you’d fix it. You can’t do that anymore. You start doing fill on the first cell. For DCAP cells, you have to know the density even before you get started and how much margin you’re going to put in there, and that’s a tradeoff in terms of cost. If you’re going to be really loose to the poly density and way under the spec, that’s a huge amount of area for your ecap. It used to be that you could do the design and at the top of the block do EM checking. You can’t do that anymore. You start with EM checking on the first inverter. In fact, the layout of the first inverter is totally dependent on the EM. Everything that you used to worry about at the back end you’re now worrying about up front, and you have to take that into account at the very beginning. It’s more challenging and it’s more work, but it’s not like you have a choice. There are tradeoffs on how much recap is being put in for supply noise. A lot of the IPs are mixed signals and have their own supplies. It’s hard enough to control our own environment. It’s much easier to divide and conquer than to take into account the worst case software stack causing a droop to impact the analog. Because the analog has their own supplies, that causes cross-domain issues, ESD and noise. But we also work with customers to show how these cross-domain issues impact the design because our IP may work in noisier environments.

SE: There is a lot of sharing of information, but is it the right information that’s being shared?

Murphy: It’s insurance. And that’s not just about trade secrets. The closer I let you go to the edge, the more likely it is to blow up in my face.

Lefferts: It is in everyone’s interest to cut down the number of use cases. If someone is doing something that is not a regular use case, and they’re doing it not for a very good reason, what ends up happening is they force the IP provider to work in that use case. The IP provider should resist.

Murphy: One interesting test of how far this can go is whether there is a compelling reason to get down to ‘gravity logic.’ So you’re running very close to threshold and you’re running at low speed, and you really need to understand margins there.

Aitken: That’s true—especially hold margin, because you can’t fix that by changing the clock speed. If you’re going to run close to threshold you have to either run at a fixed voltage with either a really low frequency or a wide range of frequencies, or you have to pick a frequency and drop on each chip the voltage until it hits that. By definition, you’re so close to the threshold that minor differences in millivolts matter a lot and your performance goes all over the map. The worst manifestation of that seems to be on hold times, because they’re short path, they have a lot of statistical variability, and all of the nastiness in near-threshold compute comes and catches you there.

Murphy: Or you have very clever voltage scaling where you can adjust this up and down depending on some monitor.

Aitken: There are tricks like that you can play, as well.

SE: How much does that impact designs in terms of adding unknowns and new use cases?

Aitken: When you’re designing something you have an idea of how it’s going to be applied and you put together all the verification you can think of, and then 15 years later someone uses it for something you never dreamed they would.

Castagnetti: At the end it boils down to whether the IP specification was written in a way that allows a different use case. If it does, and if everyone agrees there is no problem running it a new way, then it’s not a problem. The more fundamental question with IP is how much do we know is gray rather than black and white. There are certain things we know not to use. Where we get hung up more often than not is that space in between where we interpret something one way and the vendor says, ‘No, this is what we meant.’ Then a discussion begins back and forth. The effort needs to be put in to make sure that we clearly understand everything.

Aitken: When we put out the 130nm generation of memory there were specs on how much redundancy you needed, which was not very much. Ten years later, someone built a chip with 10 times more memory than anyone ever considered. That was unthinkable at the time, so no one ever considered putting it in the manual. All of these interactions problems start showing up because there are more use cases than you can think of. Even with something as simple as standard cells, you can put a million of those together and that’s not too bad. But if you put every triplet together and now you have a billion, and that’s a much harder problem.

SE: Do subsystems solve these kinds of problems or increase them?

Castagnetti: A subsystem is also a way of describing how IP is being utilized and really understanding the environment. Sometimes it’s hard to predict all the different environments.

Lefferts: One of the things we try to avoid is, ‘That should work.’ That’s a scary thought. We do everything we can to make sure everything does work. More risk leads to problems, and that leads to a significant amount of time and energy spent on fixing them. The other issue is programmability. You can allow the user to change the use model to the point where it isn’t a good idea. There is a risk if the programmability is in the use model. But programmability in general has, in more cases than not, software workarounds that make it easier. There are two sides to that coin. You can do something stupid, but you also can do a software workaround rather than a GDS re-spin. If it’s done well, it’s a huge advantage. In regard to subsystems, the purpose of a subsystem is to reduce the number of ways you can do something. In the end, that’s less risk.

Murphy: One of the interesting points about the gray spaces is they’re not always at the end. If you’re extrapolating off the curve, it’s more likely you’re not going to be where you didn’t expect to be. It also happens in the middle sometimes. You can be in the bounds of a use model, but in a configuration that hasn’t been tested. That can happen because you have an architecture switch as, say, a bus size increases. As for subsystems, they do reduce problems in some ways. But a subsystem is like an SoC. You have the same verification problems as you do with an SoC. How much can you test? The great thing about an SoC is that hopefully it will be only used in one way. But if you say it’s an IP that can be used in the way of any other IP, you may have magnified your verification problem to the point where it is out of control.

Capodieci: One of the characteristics of being at the very end of all of this is that once things come together, there is no time to fix anything. We do fix fill, but we would rather not have to do that. We also fix decomposition. Right now we do not require the IP to be decomposed a priori. Based on our rules, our algorithm should always be magical enough to decompose things. We can always figure out how to fix the fill, but not everything can be decomposed. It’s a massive roadblock, because then you have to go back and rearrange things. It’s a disaster. Testing very early for the final product is critical. We need to create scenarios that provide early prototyping of the final die, even with known functional parts, and then keep testing and testing and testing. That’s the only solution. Otherwise we run the risk, and the associated cost, of doing GDS re-spins.

To view part one of this roundtable discussion, click here.