Experts At The Table: The Growing Signoff Headache

Second of three parts: Complex solutions require ecosystem support; statistical analysis and solutions; design vs. methodology; margin effects; tradeoffs between early feedback and tool performance.

popularity

By Ed Sperling
Low-Power/High-Performance Engineering sat down to discuss signoff issues with Rob Aitken, an ARM fellow; Sumbal Rafiq, director of engineering at Applied Micro; Ruben Molina, product marketing director for timing signoff at Cadence; Carey Robertson, director of product marketing for Calibre extraction at Mentor Graphics; and Robert Hoogenstryd, director of marketing for design analysis and signoff tools at Synopsys. What follows are excerpts of that conversation.

LPHP: How do we, as an industry, improve the adoption of technologies and methodologies to improve signoff confidence?
Hoogenstryd: Synopsys is trying to work with the ecosystem to adopt POCV (parametric on-chip variation). We’re working with ARM and other IP providers to get them to buy in and deliver the collateral, with the foundries so they can deliver the models to the library providers to create the collateral, and with the end customer so they believe in the methodology. That was one of the problems with STA. As an industry—not just EDA, but the entire industry—we did an abysmal job of trying to put together an ecosystem. We didn’t have a standard in terms of a format or methodology, we didn’t have total buy-in from the foundries, the library providers said to cut the amount of data you have to collect by half. This is the big challenge in bringing new technologies out that were invented to solve problems. It’s a tremendous amount of effort required by everyone. Otherwise people look for a practical solution such as, ‘How much more margin can I add to make this problem go away.’ And then customers get their chips back and complain they’re running 100MHz or 200MHz faster than they intended.
Aitken: And then the paths that start failing are not the paths that the tools flagged.
Rafiq: There are multiple things that can go wrong as you progress with the design. One, of course, is libraries. There is a built-in pessimism in that. Then there is margin. Along the way you’re adding uncertainties into the design, and by the time you reach signoff all of those uncertainties have built up.
Molina: If you tell engineers there’s a 99% chance their design is going to tape out correctly, they don’t like that 1% failure rate.

LPHP: If you’re making 100 million chips, that’s a huge failure rate, right?
Aitken: There’s a big difference between the CAD organization, which has the recipe, and the end user. There’s a disconnect there. The CAD guys understand the statistics. The end user looks at the whole thing.
Rafiq: The guys defining the methodology are not the guys defining the design. The guys defining the design are feeling the pain.
Aitken: Having been on both sides of that, the methodology guys say they built this pristine, beautiful methodology and the designers ruined it. And the designers say they’re busy making chips and the guys in methodology have no idea how to do it.
Rafiq: At the implementation level, they definitely feel the pain.
Molina: The missing ingredient with statistical is the optimization piece. You can do all the statistical analysis you want, but how do you fix it. We focus a lot on the statistical analysis capabilities, but in order to have a complete solution you need the statistical piece within the implementation tools, as well. Without that, customers will find violations and they will have to fix everything manually.
Aitken: If we took all of the statistics we have and gave them to a university statics professor and showed him what we do, he would throw up his hands and storm out of the room. We claim three sigma, but we actually don’t measure that. What we measure is sigma of a different thing.
Hoogenstryd: What we noticed when we were doing some of the initial work on statistical STA was that some of the behavior or the effects that they wanted to model in statistical were not following nice bell curves. An example is that you will have different yield parameters based upon where the chip is on the wafer because of the light stuff. That doesn’t follow any bell curve. It follows a consistent distribution, but it’s not a bell curve. So how do you model that as a sigma?
Aitken: You wind up with a log of interesting things. Even engineers can get past their fear of the 99% if you say they can get 98% yield to get a particular benefit. They will often agree to that. But if you say it yields 98% of the time and 0% the rest of the time, that’s a very different question. Statistically, though, it looks much the same.
Robertson: Whether it’s statistical or more corners, it means more enablement for some groups, but the designers typically don’t care. What they care about is what’s important for them. If it’s statistical, they want to know how to fix it. They don’t want more corners, but they do want to know what variation is possible and how important that is. If they have to re-route, that’s okay.

LPHP: The risk is that you end up fixing too much and adding in pessimism. So what’s the solution? Do you fix it on the tool side, the manufacturing side, or somewhere in between?
Robertson: You either fix too much, or you bring up all the violations and it’s longer than you have time for, and you click that window away.
Rafiq: Signoff involves a series of engineering compromises. If you add more margin you have extra power. Can you afford that extra power? It depends on the application. For that application where the power is not that important, then you add the margin. For power-sensitive devices, which is an increasing number of them these days, you need to have only enough margin so it doesn’t impact your power. Extra margin means extra area, as well. If you have more time, you can make the design more compact. If you don’t, you’ll end up with extra area. At the end of the day you have to make a judgment call. That’s what signoff really is—a judgment call.
Molina: Statistical will help, but it won’t be mainstream for at least another year. But the other thing that contributes to overdesign is mis-correlation between implementation and signoff. Some customers are not doing much fixing in place and route, and they’re pushing on to signoff because they only want to fix those violations that are deemed violations by the signoff tools. They don’t want to over-fix because it causes congestion, it requires additional cells to be inserted, and there are all sorts of problems that result. The problem is that once you get into the signoff tools, you have to have something that does optimization. That’s an area where you can address pessimism reduction.
Hoogenstryd: There’s a balance between fixing things earlier and later in the process. In the past year there’s been a push for people to do more ECO timing fixing in the STA tool. But if they want the STA tool to fix the problems it’s going to be heavyweight, slow, and it will duplicate efforts inside engineering companies. One thing we’ve been looking at is, within the place and route system, how to get feedback earlier in the design process. One example is with rail analysis. After you do your initial placement, how can you get feedback on the potential problems? It’s easier to fix them there than in signoff, where you may have to add more vias but you no longer have any room. It’s a delicate balance between getting information to the designers as early as possible but in a way that’s also efficient. You don’t want to throw all your signoff stuff into the place and route tool, and you don’t want to wait that long for an answer.
Aitken: Some of that is architecture and engineering, too. For example, you want to make sure that you design a robust power network to start with.
Rafiq: When you build a clock tree you do the static and dynamic timing analysis. By doing a good enough job up front you have fewer issues with static and dynamic power to deal with. But after you build a clock tree, it’s more important. You have different cell positions and your dynamic drop may not be as large. Right after you build a clock tree, even before you go to routing, you need to do the analysis. You may have to shift the cells around or if they’re clustered together, you need to spread them around. That’s particularly important at 28nm and 16nm. At 40 we were focused on static and dynamic leakage. At 28nm and 16nm, the dynamic portion is more dominant.