First of three parts: Density and radiation; the effect of multiple power islands and complexity; more rules.
By Ed Sperling
Low-Power Engineering sat down to discuss reliability with Ken O’Neill, director of high reliability product marketing at Actel; Brani Buric, executive vice president at Virage Logic; Bob Smith, vice president of marketing at Magma, and John Sanguinetti, chief technology officer at Forte Design Systems. What follows are excerpts of that conversation.
LPE: Do chips become less reliable as we start adding power islands, multiple voltages into the mix with smaller line widths and multiple cores?
O’Neill: As we go down the process curve, one of the effects we have observed is that radiation effects are no longer the domain of designers just working on aerospace systems. We’ve seen radiation effects start to become dominant from a reliability standpoint in commercial ground-level systems. These are background neutrons in the atmosphere. That can cause memory cells to have single-bit upsets, and the consequences can be severe. For most applications, you don’t care about a single bit. In consumer electronics, it may be a pixel changing color. But if you have system-critical data in a server or a router it could mean enterprise-class equipment having field failures.
Buric: All of those effects started showing up at 90nm when people started doing a serious analysis of the radiation space. It’s not getting better. It’s getting worse each node. There are two aspects of dealing with this. One is how the system is designed. The other is how system designers can further ensure that nothing has happened.
LPE: Does it show up on the design side?
Sanguinetti: We don’t see it directly, but our customers are putting tighter requirements on the style of RTL that we produce. Most of it includes locally produced rules. Those rules have been developed to ensure a smooth transition through the back-end flow. Now they’re starting to add rules it will produce better results. That’s about our only exposure to the physical side. The other piece of reliability is whether you’ve got the logic right, and that’s just verification.
Smith: On the implementation side, reliability is more around design rules. Presumably some of these effects are covered by those rules. What’s most important in our world are the low-power effects—multiple voltage domains, turning blocks on and off, dynamic voltage frequency scaling. The question is how you model that design so it will be reliable under a bunch of different operating conditions. It’s now multi-load, multi-corner and multicore. We’ve seen designs that have up to 100 different models. That affects reliability.
LPE: From the standpoint of all the multiple everything, does it become less reliable as you keep dropping the voltage?
Buric: You’re talking about a badly implemented design. We have customers today that, because of technology scaling, have thousands of memory instances on the same design. This is a nightmare by itself. There is also a lot of process variation. You need to make sure that the same device in place ‘A’ has the same performance on place ‘B.’ You have to consider all of that. If you do it wrong, you don’t get proper yield. But you also might have a chip that escapes detection in the test but which has problems in the field. We have customers making hearing aids. They are trying to drop the voltage as low as you can get, and they are really hitting the limit at low voltages for both how to maintain content in memory and how to operate. Then the design misbehaves because they are pushing the lower corners of performance. On advanced nodes you also have a phenomenon of low voltage combined with other low voltages behaves much worse than a typical array. That creates additional reliability problems.
LPE: Finding all the corners and verifying all the various pieces is getting harder. What’s the impact on reliability with that?
Smith: The amount of time that’s being spent on verification, particularly on small geometry and very complex chips, is rising. Implementation and place and route is still a huge job. But the verification is taking more time. Most of the verification flows are 15 years old. Variability on chip is a huge effect. At 180nm we didn’t have to worry about that very much. At 40nm and 28nm, it’s a big concern. Verification takes a lot more time and more compute power. It now takes a CPU farm. The industry is ripe for big change there. In terms of reliability, it depends on whether you’re going to do a marginal design or a really good design. To do a good design you have to do a lot of verification and you have a lot of operating conditions. And if you’re doing a low-power design and using all these techniques, there are likely to be a lot of operating environments. Then you have to consider all the corners in the process. When you multiply those out you have hundreds of different scenarios.
Sanguinetti: These issues are less important for some products than others. But the line between what we think of as consumer products and other products is blurry. Is an industrial printer a consumer product? If something goes wrong there, you may get degraded quality. But what if it’s a prototype of a device that prints 3D models. A failure there, or an intermittent failure, can result in something a good deal worse. Is a hearing aid a consumer device?
Smith: Or a pacemaker?
Sanguinetti: Reliability also is a bigger issue if you have to reboot something. Do you really want to reboot your television or your hearing aid.
Buric: But when you talk about power islands and multiple voltages, those techniques are much less used than you might expect. It’s extremely difficult to use them in a reliable fashion. We see people staying away from using them, especially in consumer devices. It’s extremely expensive. If there was an easy way to verify them, that would eliminate some of the problems we face now. But right now, there’s much more noise than reality.
LPE: Is it the same in the verification world?
O’Neill: There are two aspects to that. There’s the verification we as a company do. And then there’s the verification our customers do on their designs when they’re getting ready to program the design into a part. From the customer perspective, the changes we’ve seen are scaling linearly with the fact that they are doing bigger and faster designs. But it’s not an exponential increase in the amount of verification and validation of those designs. A lot of the deep submicron effects are somewhat shielded from the customers by the fact that there’s a level of silicon design and we just give them a tool to place logic into this part. The onus is on us to design our FPGA with as much competence as possible to mitigate the deep submicron effects. From the point of view of the design teams at Actel, they are spending more time using more complex tools to do verification and validation of their designs. In addition to implementing programmable gates and routing structures we’re adding hard IP into our parts. That includes multiply/accumulate blocks, increasingly sophisticated I/Os, different flavors of SRAM cells and various forms of non-volatile memory. All of those things represent different types of what is effectively ASIC design, and it’s now being done by the Actel design team. We, on the other hand, have experienced an exponential increase in validation and verification.
Leave a Reply