The usual sources of failure still exist, but there are new threats and some unexpected twists for engineers.
By Ed Sperling
Most consumers define reliability by whether a device turns on and works as planned, but the term is becoming harder to define as complexity increases and systems are interconnected.
Adding more functionality in less space has made it more difficult to build complex chips, and it has made it more difficult to prevent problems in those chips. Verification coverage is a persistent and growing challenge, and it’s exacerbated by the fact that it’s nearly impossible to predict all the possible use cases for a device. On top of all this, there are some interesting new twists on reliability that have never entered into this scheme before.
Moore’s Law 2.0
From 30,000 feet, Moore’s Law looks intact even if fewer companies can afford to reach the next node. The reality, though, is that not everything is shrinking at the same rate. Using Gordon Moore’s terminology, the number of transistors being crammed on a piece of silicon isn’t doubling every couple years anymore. The number is increasing at each new process node, but it’s no longer a two-for-one replacement.
There are good reasons for this, of course. Bigger transistors—particularly when it comes to memory—are more reliable, less expensive, better tested and proven in silicon. In fact, some of NXP’s logic chips have been continuously tested for up to a dozen years to analyze where the failures are, according to Clifford Lloyd, business development director at NXP.
“We even weed out parts that are within the limits of what’s acceptable but which we still consider outliers,” Lloyd said. “And with very low power, power and heat are no longer as important. When we do see failures it’s usually due to overstress—too high a voltage load. And over time, we also see some failures from ESD.”
This is to be expected at the most advanced process nodes, but even mainstream nodes are undergoing some significant shifts.
“Even at the older technologies people are doing more exotic things,” said Carey Robertson, director of product marketing for Calibre extraction at Mentor Graphics. “The first time through you see digital SoCs using a consistent voltage for all devices. Now at 40nm and 60nm t,here are high-voltage applications for the automotive segment and the chips have to last longer. We expect those chips to last longer than the chips in our phones. There are older technologies with more advanced techniques, newer nodes with more unknowns, and then there are companies trying to mix the two with 3D-IC, 2.5D. How do we do ESD in the presence of multi-die implementations? There’s a perfect storm around that.”
The problem gets worse at advanced nodes with density-related issues such as heat and electromigration. The typical solution in the past to these kinds of problems has been to add more margin into the design. That’s no longer an option because margin adds cost, makes it harder to verify the chip, and it adds more wiring that can slow down a signal and increase heat.
Lithography plays a role in reliability, too. While the manufacturing side typically is hidden from most chipmakers, other than following design rules from foundries for reasonable yield, the two worlds are more intertwined than ever before. In real numbers, memory chips are seeing only about a 30% to 40% area reduction at the new nodes, rather than the previous 50% reduction, according to Prasad Saggurti, senior manager of product marketing for memory and memory test at Synopsys. The reason is that the slightly larger sizes are more reliable.
Existing 20nm back-end of line processes are more reliable, too, because EUV lithography is still not commercially viable, which is why chipmakers are able to use double patterning with 14/16nm instead of triple patterning, which would be required with a 14/16nm BEOL process.
“Can we do the multiple patterning?” asks Ajit Manocha, CEO of GlobalFoundries. “Yes, but at what cost? Sometimes you have to make decisions so the shrink that will not increase your cost. So for 20nm to 14nm, that’s a shrink, but our costs will not go up in the back end and the middle of the line. If you push everything to 14nm the footprint will go slightly smaller, but your defect density will kill you because the number of net dies will be lower. You get some density advantage, but the overall cost will be much higher.”
Premature aging
Another consideration that has begun surfacing at advanced nodes is premature aging. Chip companies everywhere are grappling with the premature aging issues caused by power density. While finFETs offer significant improvements for leakage, they don’t solve all the thermal issues.
“Thermal effects cause aging,” said Vassilios Gerousis, distinguished engineer at Cadence. “That eventually affects the timing because things will slow down. So far, we haven’t seen that because the margin takes care of that. But we will. And we also have electromigration and ESD. Electromigration is applied on the power/ground and also on the signal. We used to look at it from a static point of view, but at advanced nodes we also have to look at it from a switching perspective. Those are becoming critical because the width gets smaller, the resistance gets higher, and with finFET the drive gets bigger and the wires cannot handle the current.”
Most of this isn’t a surprise to tool companies. All of the major EDA companies have been working with the foundries at the leading-edge nodes to identify problems and address them.
“There are three different categories to deal with on the implementation side—IR drop, electromigration and device aging,” said Mary Ann White, marketing director for Synopsys’ Galaxy Implementation Platform. “Synopsys’ tools provide monotonic convergence with shared engines between tools that are reliability-aware to deal with the IR drop and EM issues at both the gate and transistor levels. Regarding device aging, foundries and IDMs are running extra test chips to develop models that can be supplied separate from the PDKs, which are used by our SPICE products for analysis.”
The downside is that there are so many unknowns at the leading edge of design, and very little data to work with. In fact, the only company actually producing commercially available finFETs is Intel at 22nm. Everyone else is still working with test chips.
But Intel also uses very regular structures and layouts. SoCs are much more complex to design and build, and the voltages tend to be more consistent. SoC makers are wrestling with everything from more congestion to just how wide the wires should be. It’s well known that wires no longer can be shrunk at the same rate as transistors because it increases the resistance, which causes heat. What’s less well known is that chipmakers are now doubling the diameter of some wires even as they shrink transistors, and that’s creating new issues involving routing congestion.
“In an ASIC design cell we handle uncertainty with margin,” said Drew Wingard, chief technology officer at Sonics. “As uncertainty goes up, margin goes up. The gap between the quality results you can from an ASIC design cell and a custom design cell has hit historic highs. What you can do, if you’re willing to do the hard work and all the analysis, versus what you get from the highly margined ASIC—that gap continues to grow. Where we see this becoming an increasing challenge is with some of the other physical effects. As the number of transistors goes up, the cost of communicating across the die goes up within ever-shorter clock cycles, so the penalties of those margins is growing. The number of primary clock cycles just to move a signal across a die is between 5 and 10. That has profound implications on what’s meaningful for design.”
It also has implications for just how reliable a design will be. Margin equals reliability in the minds of many design engineers and chip architects. Take that away and the safety net goes with it.
Beyond the SoC
Reliability of an SoC goes well beyond just the silicon these days, though. A system involves software and many other parts. Increasingly, it also involves security issues. When a system crashes, the hardware provider is the first to be blamed, even if it has nothing to do with the hardware. Ironically, the hardware is frequently the most reliable part of the system, but it’s also the most tangible. After all, you can’t kick software.
One of the biggest problems with software is security. It’s the primary entry point for malware, which in a worst-case scenario can kill a device. As a result, chipmakers and IP vendors have begun to build in security features to avoid these problems.
“We’ve already got firewalls in the NoC architecture,” said Kurt Shuler, vice president of marketing at Arteris. “We don’t market it much. But you do want to make sure that when you send stuff from point A to point B that it’s authorized to move from A to B. In some cases you can fire back a message that says something is illegal. But in others, you don’t want to do that because hackers can target error messages to find ways into the system. There also can be a hidden error message, which cannot be perceived by someone who has hacked the chip.”
Shuler said this kind of thinking is essential for SoC makers because the chipmaker always gets blamed first. “The key is you don’t want to get blamed, and that blame may get extended. You can brick a phone when it’s stolen, but how long does it take to un-brick a phone—or avoid bricking it in the first place?”
Conclusions
Talk to anyone in the semiconductor ecosystem about reliability and they all claim to take it very seriously. And when it comes to their particular slice of the ecosystem, that’s clearly true. Where problems begin creeping in is around the edges and at the seams—in the integration, the software and the myriad use cases (including security).
Migrating from one process node to the next certainly creates challenges, but so do changes at older nodes. And so does pushing IP from one node to the next and integrating it with something against which it was never tested. The problem is there is no single way to handle reliability issues, and usually not even a consistent way of dealing with it from one chip to the next, or even from one derivative chip to the next. And even if everything is done right, security issues can still bring a system down.
Leave a Reply