Experts At The Table: The Reliability Factor

Second of three parts: The effects of new process nodes; end of life considerations, and special circumstances for breaking the rules.


Low-Power Engineering sat down to discuss reliability with Ken O’Neill, director of high-reliability product marketing at Actel; Brani Buric, executive vice president at Virage Logic; Bob Smith, vice president of marketing at Magma, and John Sanguinetti, chief technology officer at Forte Design Systems. What follows are excerpts of that conversation.

LPE: As we push to the next process nodes, do all of the tricks of power islands and multiple voltages become more common in designs?
Buric: No, because people will look at the cost of implementation. We have customers looking at simulating and figuring out what are the power savings of switching something off and turning it back on. That costs power. They are doing fairly complicated analyses and staying away from these techniques if they don’t have to use them.

LPE: If you implement all of these techniques, though, is a device more reliable or less reliable?
Buric: In my opinion, it becomes more difficult to make it reliable. But it’s all comes back to your capabilities. If you know how to do it and you’ve done it before, it’s more difficult but you can still do it.
Smith: I don’t think it’s de facto less reliable. But it makes it a heck of lot more difficult to maintain reliability. That requires a lot of work. The other side of this is that low power used to be just battery-powered devices. Now it’s part of the overall Green movement. Everybody is searching for ways to cut power. About 1.5% of all the power generated goes into servers, server farms, and the cooling associated with it. We have a lot of wireless customers. Power is a huge deal. Reliability is a huge deal, too. If you have a phone that dies all the time, the manufacturer is going to go out of business. But we’re starting to see more of a focus on low power for things people plug into the grid, and the government is starting to mandate that. Reducing power without giving up the reliability is very hard.
Sanguinetti: That’s true even in this industry. Quite a number of companies have verification server farms. You run your regression tests with 10,000 processors. That has a power bill of about $1 million a year. That’s a lot of electricity.
Buric: We have 1,000-plus processors just to do fairly straightforward tests. One of the things that is responsible for what’s going on here at the lower technology nodes is that IP memory and logic design and EDA tools help solve reliability problems that may be caused by calculation errors from migration that overload certain parts of the design. It’s much more critical on the current nodes than before. You have to properly characterize IP and use synthesis, high-level synthesis and place-and-route tools to avoid any potential reliability problems in operating conditions.
O’Neill: We’re all coming at this from a variety of different directions, but in our world we see an intersection between low power and reliability. This is from a system-level for high-reliability and outer-space systems. The motivation for achieving low power in the consumer space is battery life and the greater good of the planet. In military and aerospace, battery life can be important in handheld radio systems, but there’s much more interest in reliability. Power becomes a reliability issue for a different reason. There are a lot of very high-performance systems where they can’t have forced cooling. The cooling fan itself is a factor in reliability. Fans can fail and they can increase the risk of foreign debris, which can cause short circuits and add other reliability risks.

LPE: So it’s imperative to lower the power in the parts?
O’Neill: Yes, because as you increase the performance of these systems, more power is being generated. That means more heat, and heat equals reliability risk. There are very few failure mechanisms that decelerate with temperature. Most of them accelerate with temperature, and some of them accelerate exponentially. So minimizing the heat dissipation inside enclosures that have very complex systems running inside them becomes a primary issue. People often come to us seeking low-power FPGAs because there’s some other power-hogging device inside the enclosure generating a lot of heat. They need to minimize that. They can’t afford another heat-generating device.

LPE: Do devices that use lower power last longer?
O’Neill: Given the same process node, if you run at lower power you’re probably going to have better reliability because your junction temperature is lower. That’s going to result in prolonged life.
Buric: If you look at what people are doing with end of life now, they’re saying lower temperature will extend the life. It’s very clear that if you go to overheated conditions where a lot of parts become unpredictably fast then you have a very high chance of a device completely malfunctioning.

LPE: Where does this get designed in?
Sanguinetti: We don’t have any visibility into this.
Smith: Neither do we. It’s something our customers have to deal with. Our job is to get them from concept to manufacturing. Their job is to figure out the application and the expected lifespan. In some applications, a year is fine. If it’s going into the engine controller of an automobile, a year is not fine. That would be more like 15 or 20 years at a minimum.
Buric: End of life is primarily a process function, and for that reason it is analyzed and characterized at the device level. A lot of people simulate end of life, and those models are typically provided by the process side. These are similar to any other SPICE model.
O’Neill: When we design a chip, we design with a package that comes from the foundry. That package will include things like the design rules, which are decided by tradeoffs between reliability, power, functionality and sheer utilization. You want to cram as much logic into as small a space as possible, but you’d better not exceed the electromigration rules or whatever other rules are in there for reliability.

LPE: In the synthesis world, what’s the biggest reliability issue?
Sanguinetti: Logic errors. It’s getting the design right. The issues that our customers have, aside from the spec being wrong, is interfaces between blocks or sections of the design that are done with high-level synthesis and those sections that are done with legacy RTL or manually. That’s where the opportunity for errors is the greatest. When you’re within the confines of high-level synthesis, the opportunity for error gets reduced and the verification problem is reduced. But at those interfaces there’s a lot of potential for miscommunication.

LPE: As the industry becomes more disaggregated, does it become more difficult to pinpoint the source of reliability problems?
Buric: No. Problems are well defined and everyone has to take responsibility. If you go back to our discussion about radiation and alpha particles, you design to eliminate that problem. That design can be in the memory space or error correction. You know what the problem is and you solve the problem. If the problem is end of life and the technology provider gives you all the guidelines, then you design with those in mind. If you own the design, you own the problem. It’s in designers’ hands and it’s a well-defined boundary.
Smith: It’s a gray area. In the ideal world, we get a set of rules from the foundry. There are thousands of them, and if you do this and this then they’ll stand behind the process in the design. For a big part of the population, that’s the way it works. But the people with the deeper pockets will go back to the foundry and say, ‘Let’s talk about where these margins really are because we need to do something special for this product line.’ They’ll get the data, analyze it, and they may design outside of those guidelines. The rules are the rules except when you break them to get an advantage, and then you’d better have the time and the money to go analyze everything to make sure you don’t end up with a product that fails or doesn’t yield.
Buric: You actually don’t break rules. You define a new set of rules that are mutually agreed upon. If you set those rules unilaterally, you’re shooting yourself in the foot. Memory cells have violated every rule, but they’re so predictable in manufacturing that they can violate the rules. All of those are mutually agreed upon, though.