How Reliable Is Your IP?

There is still much that can be done to improve the reliability of complex electronics.

popularity

Almost everyone who has bought a new smartphone, car, home electronics device or appliance either has experienced technical glitches that require a replacement or repair, or they know someone who has experienced these problems.

The good news is that only a very small fraction of the electronic glitches or failures can be contributed to hardware design. Most of it is due to manufacturing variability and controls, damage in shipping, and software glitches. But there are still some hardware design problems that slip through, despite significant advances in tooling, experience of working through problems, and increasingly rigid manufacturing rules.

While the individual components and IP inside of SoCs are much more reliable these days than in the past, so much is being crammed onto chips these days that chances of something going wrong are higher. That’s basic statistics. No matter how well characterized, commercial IP frequently is a black box going into another black box and surrounded by other black boxes. IP blocks also are connected to more things, making them more prone to infections from malware that can render them unusable. And the software is so complicated that it can’t always account for all the possible use cases, running down the battery in a fraction of the advertised or expected time.

Given the rising complexity of SoCs, this isn’t particularly surprising. And given that third-party Intellectual Property now constitutes an increasing portion of a more complex chip—often more than 50%—it’s probably even less surprising.

“One of the big problems is that the building blocks are being produced by companies that are not building the final product,” said Michael McNamara, CEO of Adapt IP and DAC’s IP chair. “If your IP is 99% reliable, and you multiply that by 100 IP blocks, your overall reliability is pretty low. We need to find a way to make sure that failure of IP is not a catastrophe—a way to gracefully route around problems so that when a piece of functionality is not available you can switch to a backup.”

He said that otherwise we’re going to be left with people distrusting technology. One example: Using names that a car’s infotainment system can understand so that when you’re looking for a restaurant it actually recognizes the name, regardless of who’s asking and how thick their accent.

What can go wrong
Large IP vendors take great pains to characterize their IP effectively, and for the most part they’re very good at addressing the possible use cases and integration scenarios. IP from major vendors is widely used, well characterized and extremely reliable. But they aren’t the only ones building IP for SoCs, and what plays well with other IP or system packaging in one complex design might not work well in another. All IP vendors are very focused on integrating their code into the best possible environment. Reality is somewhat different.

At least part of the problem is the push toward much lower price points for more and more products, particularly in cost-conscious markets such as China, Brazil and in developing economies. “The product cost is dependent on the package people use,” said Navraj Nandra, senior director of marketing for the analog and MSIP Solutions Group at Synopsys. “So they may use a cheap substrate or board materials. They may use a two-layer or four-layer board, where in the past that IP was designed for more layers. That has an implication on the way IP is designed.”

At the other end of the spectrum are SoCs that are being developed at the leading edge process nodes, where none of the processes are firmed up by the time IP needs to be developed. “You make assumptions based upon what you think will happen,” said Nandra. “Starting at 28nm, the whole idea that version 1.0 is the production version was tossed out the window. We began seeing updates of processes for competitive reasons.”

Foundries have been developing new flavors of established nodes to focus on power, performance or cost. At 28nm, which is expected to be a particularly lucrative node for foundries, GlobalFoundries and Samsung have even added support for fully depleted silicon on insulator (FD-SOI) to reduce leakage. TSMC, meanwhile now offers five different flavors ranging from low power to high performance and high-performance compact mobile computing. IP needs to be characterized for all of them.

Yet it’s also possible that even the most glaring defects won’t ever cause a problem. “Every single design can be broken under obscure circumstances, or it may not occur because it’s masked by some other function in a chip,” said Chris Rowen, a Cadence fellow. “The reality is that once you have complete specs IP can be extremely reliable. A conversation over a cubicle wall may be fast and efficient in terms of making changes and it evolves rapidly, but it is limited in what it can address and it can be a liability when it’s not sophisticated enough to look at all the ways in which a block is stimulated. If you just write down things that you think it needs to do, it’s probably under-specified.”

He said IP needs to be considered in two directions, inward—was it specified correctly—and outward to the system, where you need to know whether the system is only asking things within the spec.

Defining and redefining reliability
So what exactly does reliable mean? “Reliable is a big word with a lot of meanings,” said Bernard Murphy, chief technology officer at Atrenta. “Sometimes it’s difficult to disentangle usability from actual reliability.”

For example, not all sensors have to be on all the time. So in a car, even in a key-off position, many functions continue to draw power from the battery when they really don’t have to stay fully awake. The more sensors, the faster they drain the battery “That’s where a lot of things can go wrong because you have these power management options that are also not necessarily comprehensively tested,” Murphy said. “And lot of semi guys put in CYA options. They put in software management options to say if it turns out that some kind of power switching option doesn’t actually work correctly, we can disable it. So there you have a tradeoff between something not working at all or working, but your battery is getting drained a lot faster.”

Reliability also can change by process node. Craig Hampel, chief scientist for the memory and interface division at Rambus, defines IP reliability in terms of durability and endurance, but he added that it also includes the number of corner cases that IP is being integrated and the security model for the IP.

“What we’ve seen is planned obsolescence around Moore’s Law, and if scaling slows, the renewal cycle slows,” said Hampel. “So that means you keep devices longer. But endurance characteristics get worse at finer geometries, and that affects reliability.”

In memory, there are concerns about bit soft errors, too. Those rates are higher in SRAM than DRAM. In addition, DRAM has the ability to be corrected once, with pathways to memory bits fixed prior to shipping. But that’s a one-time fix to ensure yields are acceptable for memory IP. Once in the field, there is no capability yet for repairing DRAM failures.

Repairing other failures, such as logic cells, is becoming problematic, too. Designs are so complex that it’s difficult to pinpoint the cause of failures. That’s particularly troublesome in areas such as automotive markets, where defects can be extremely costly.

“Fab processes are improving as we continue to shrink features, but the types of defects are changing,” said Ron Press, technical marketing director for silicon test solutions at Mentor Graphics. “We’re seeing more defects inside of cells, which may not be perfectly regular. They’re the kinds of things that regular tests miss. Some of these are intermittent outages, which are hard to diagnose. And on the automotive side, they’ve added warm-up before testing because if you just start running a device and test it that won’t be reliable.”

Difficult-to-fix stuff
There are some problems that are much more difficult to fix, though. Arvind Shanmugavel, director of application engineering at ANSYS , said there are three basic areas that are challenging. One involves functionality operating across all process-voltage-temperature (PVT) corners. The second is electromigration or self-heat, which has to be verified. And the third is more electrostatic discharge as more and more IP is added to the same system.

“Most IP vendors design the same IP for different process nodes or foundries,” said Shanmugavel. “But the reliability rules are different for one versus another. And at the high end of designs, there is no way to test new products for all operating conditions. If you have one pathological condition it may not be covered by the reliability testing team. We know about timing models. But how about noise models and current models? We haven’t mastered those yet.”

There are other issues, as well. At advanced nodes, single-event upsets are more common because of higher density. Richard Grisenthwaite, chief architect at Arm, said it’s not a linear relationship, but it’s close to that. But he said the challenges with software are much larger.

“Because software has an important role in reliability, the challenge of multiple cores is concurrent operation of the cores where you are sharing data,” Grisenthwaite said. “The timing relationship is more variable and it’s harder to do an exhaustive test. Asychrony between cores makes software code harder to write than a single core program, and until recently the world’s programming languages didn’t support it. There were updates to C and C++ in 2011 that added a more formal approach to multithreading. The data rates for programming are formalized better, and that is the main reason you’re seeing multiprocessing more.”

He said that has significantly reduced the number of errors in the multicore programming world. As for the rest of the errors—chip architectures are the source of a “tiny proportion’’ of failures. For those that do remain—and particularly in settings such as data centers—the key is to find them, contain them, and be able to fix them with minimal impact on anything else.