Chips are becoming more reliable, but not necessarily because things don’t break or because there is more redundant circuitry. It’s getting much more complicated than that.
The definition of reliability hasn’t budged since the invention of the IC, but how to achieve it is starting to change.
In safety-critical systems, as well as in markets such as aerospace, demands for reliability are so rigorous that they often require redundant circuitry—and for good reason. A PanAmSat malfunction in 1998 caused by tin whisker growth wiped out pagers for 45 million users. In automobiles, recalls routinely cost hundreds of millions if not billions of dollars. And in consumer electronics, returns of malfunctioning devices can ruin a manufacturer’s reputation.
But not every device fails completely. In fact, reliability is becoming a more relative term because not every failure is total, and some can be restored—at least to a point. If e-mail fails, the phone is a total loss. But if it takes a half-second longer to download a message, it may not be noticeable to the user.
SoC makers have been building in self-repair circuitry in most devices for years for just this purpose. When something breaks, devices automatically reroute signals to some redundant circuitry, which usually falls under the category of margining. The problem with this approach is that too much margin affects power and performance at advanced nodes.
A less well-known approach, and one that is just beginning to be used, is to not fix everything. Instead, identifying the problem and understanding where it is may be enough. In a design with 2 billion transistors, the assumption is that not all of the transistors will work, and not all of them will work for the life of a device. Even in an eight-core processor, not all of the cores may function equally well. But that may not be as noticeable to a user as to an engineer with a workbench full of sophisticated tools.
“This is the difference between reliability and resiliency,” said Ruggero Castagnetti, distinguished engineer at LSI. “There is work being done at the university level on this right now. What’s happening at the advanced nodes is that you’re starting to see more area to do additional things at almost no cost, and at those nodes there is more stuff you can do that you might not have needed to do in the past to make the system more robust.”
He said the key shift in thinking is that if something goes wrong, it doesn’t necessarily have to be fixed—as long as it doesn’t hurt the system.
This in no way implies that electronics are becoming less reliable. In fact, the opposite is true. They’re becoming far more reliable—usually painfully so for design teams.
“Reliability test coverage used to be 99.2%,” said Mike Gianfagna, vice president of marketing at eSilicon. “It’s now up to 99.8%. That’s much, much harder to achieve, but people are demanding it. It requires many more tests and more corner analysis.”
Along with that better coverage, though, the methodology for dealing with circuit failures is changing.
“There are a lot of replicated engines on a die,” said Taher Madraswala, chief operating officer at Open-Silicon. “If there is a 16 x 16 array of processing units and three of the nodes don’t work, the software doesn’t recognize those. But there are still enough of them to do the work.”
That’s the key. Designs are expensive, but at advanced nodes there is plenty of silicon left to do interesting things. It’s not free, per se, but it is readily available. And in complex SoCs, there are enough processing units available so that when one fails, others can pick up the slack. Being able to utilize these extra cores or processors requires dynamic failover capabilities be programmed into the software—work that is definitely not trivial. But it has been proven to work already in solid-state drives, among other places, where wear-level functioning of the flash controller is a critical part of the design, said Ulrich Schoettmer, solution product manager at Advantest.
“On the logic side it is very common place to grade microprocessors, GPUs, application processors and so forth via upfront fusing to deal with market ‘binning,’ as well as with functional defects, such as downgrading to smaller caches, or from a four-core unit to a two-core unit,” said Schoettmer. “A dynamic functional redundancy is not yet commonplace, but may come sooner or later.”
He noted that while processor makers have been dealing with these kinds of issues for years, they are now migrating out into the broader mobile market where margins are so tight that the budget for test and the demand for quality and reliability are falling out of sync.
He’s not alone in seeing a shift. Drew Wingard, CTO at Sonics, views the growth in transistors as essential to maintain reliability. “The underlying silicon is less reliable, which means you need more transistors just to achieve the same reliability. You may have not choice but to use extra transistors simply to detect problems. And errors are okay in some situations as long as you can find them. But what’s changed is that you’re not adding extra circuits necessarily to fix the problem. These are not parallel gates.”
Reliability over time
While hardware engineers think in terms of functionality, reliability is actually a measurement of functionality over time—and generally after a design is in use out in the real world. It’s up to the design teams to build a robust design, but just how robust it will be depends on a variety of factors ranging from individual use models and length of service to where devices are used. Tin whiskers have been a problem for mil/aero sectors, particularly in outer space, but zinc whiskers have caused damage inside climate-controlled data centers.
“One side effect of system integration on a chip is that you have to pay more attention to potential failures,” said Pranav Ashar, CTO at Real Intent. “There is positive side to this. Integration of everything on a single substrate means that electrical issues are better understood. That’s why we’ve been able to develop static solutions to problems. The flip side is that the cost of failure is also greater. But if you deadlock, typically you can work around a problem in software that controls the timing of state machines. And you don’t have to build that into the process.”
At least part of what’s driving this change is a proliferation of processors and memory around an SoC.
“Not every feature may be working,” said Open-Silicon’s Madraswala. “But in the architecture there are enough features loaded in to offset that. If you still have a bus that works, you can reroute signals. There is a lot of software being written to make use of multiprocessing, and there are a lot of replicated engines on the die. So the definition of how to achieve reliability changes.”
Shifts in older approaches
The old way of guaranteeing reliability was pioneered first by the military, then by aerospace, and more recently by the automotive industry. Rather than rely on one circuit, there is triple modular redundancy. For every module, there are two others for failover in case something goes wrong—and in the case of mil/aero, the initial concern was radiation damage that could flip a memory from a one to a zero, or vice versa.
“With device geometries shrinking, there is an even bigger risk of radiation damage,” said Angela Sutton, Synopsys staff product marketing manager for FPGA implementation. “Not everything is radiation hardened, so it’s more susceptible to damage. Companies have built monitoring into the design for error detection and correction, and they build redundancy into the circuitry. But they also pick and choose where they want to apply these techniques because it’s expensive.”
Sutton said it’s not just about error correction anymore, either. As devices become more complex, with multi-mode operation, various states can be disrupted and need to be reset.
“The monitoring could be done in software,” she said. “But the key is to figure out why something malfunctions.”
Betting on better software
Increasingly, that is the job of software, which is being used to manage safety, reliability and security—at the same time.
“The key is how you make sure you are not sacrificing reliability as we move to more powerful SoCs with more functionality,” said Kamran Shah, director of marketing at Mentor Graphics. “We’re starting to see certification standards that we only used to see in the military and aerospace. IEC 61508 deals with functional safety for electrical programmable systems. For medical devices there is IEC 62304, and for automotive there is ISO 26262. A lot of this is due to the increasing role of software, but part of it also is because of the increasing connectivity of devices with the Internet of Things.”
Shah said that while hardware reliability is well understood, software is much more complicated. What if an interaction with other devices inserts an error or bug into the software, for example? That may mean completely different things for a smart phone, a car and a pacemaker. In fact, it may mean different things within each of those devices, depending on which system is affected and why.
Shah said that in software, there are ways to isolate signals using type 1 hypervisors, fast reboot using executable code rather than a full operating system. “But there are other things you generally don’t consider in software that are becoming important, such as what does power consumption mean to reliability. For a lot of systems, reliability is defined by mean time to failure. Being able to decrease power consumption for passive cooling can decrease that mean time to failure.”
The future
As complexity grows, so do the number of tradeoffs. As with power budgets and performance specs, reliability is a fixed number with real dollar consequences if something goes wrong. The $1.2 billion fine paid by Toyota for unintended acceleration is a case in point. So are the recent insulin pump recalls by the FDA, and so is the Target security breach.
“A lot of design concepts like fault-tolerant computing are coming back,” said Steve Carlson, group marketing director in Cadence’s Office of Chief Strategy. “We used to call it fault-tolerant computing, and it’s being pushed down into SoC validation and verification strategies these days. Architecturally, instead of a 100% hardware accelerator you now have a programmable accelerator so you can plan for the higher risk of failure.”
Just handling all of this by blunt force redundancy doesn’t suffice anymore, particularly in an interconnected world. All of this has to be thought out in much more detailed and engineered effectively, both in hardware and software.
“This will require much more logic for resiliency,” said Kurt Shuler, vice president of marketing at Arteris. “Redundancy is a way to get reliability, but resiliency is a graceful failover using other paths when something bad happens.”
And something bad certainly will happen—with a very high cost for SoCs, the systems that house them, and the companies that sell them unless they have thought this through in great detail.
Leave a Reply