Are Chips Getting More Reliable?

Maybe, but metrics are murky for new designs and new technology, and there are more unknowns than ever.

popularity

Reliability is emerging as a key metric in the semiconductor industry, alongside of power, performance and cost, but it also is becoming harder to measure and increasingly difficult to achieve.

Most large semiconductor companies look at reliability in connection with consumer devices that last several years before they are replaced, but a big push into automotive, medical and industrial electronics using advanced SoC designs has raised the bar. In those markets, chips need to function for 10 to 15 years, rather than the 2-to-5 year lifespan for many consumer devices. Moreover, chipmakers selling into those markets are being held to much tougher standards than in the past, even if their chips are not designed for safety critical systems, because everything in a car is now connected.

What’s clear about all of these markets is that reliability is now a selling point as well as a requirement. A JD Power study released earlier this month said reliability concerns are one of the reasons car buyers choose a brand or avoid certain models, with 55% of new-vehicle buyers citing reliability as a top reason for buying a car versus 51% in a 2015 study. The reverse is true, as well: 17% said they are now avoiding certain models, compared with 14% last year.

But reliability in chips isn’t so easy to measure for a variety of reasons:

Definitions are murky. Reliability is a measure of functionality over time. If a chip completely stops working, it is deemed unreliable. But in many cases performance degrades over a period of years, whether that’s due to electromigration, layer upon layer of software patches, memory bit failures, or a host of other issues that can crop up. Reliability in these cases may be more subjective than definitive.
Security is never guaranteed. Security is a new measure of reliability, and the longer a device lasts, the more likely it can be breached. The knowledge base of hackers continues to expand, and tools and capabilities improve. If a five-year-old device is hacked using technology that was not invented when the device was created, should it be considered unreliable?
Longevity is guesswork. Most advanced semiconductors are reliable enough for devices that last two years. Some of them could last 50 years. But the reality is that no one knows what problems will erupt with 16/14nm chips, or whether they’re better than 10nm chips, because dynamic power density is a new challenge. Even 28nm has only recently become a mainstream node for development. And as new technologies such as finFETs, gate-all-around FETs, new materials, new interconnects such as through-silicon vias, and new processes are introduced, there are always uncertainties.
Use cases matter. As semiconductors become more complex, it becomes more difficult to find all of the bugs. Some of those may be use-case dependent, which makes the definition of reliability not only unique for each device, but unique for each user. With more components competing for memory, multiple voltage islands, blocks being turned on and off, and much more software and firmware, the usable lifespan of a chip looks more like a bell curve than a fixed number.

None of this is relevant in a set-top box or a smartphone. But it is relevant in an industrial control system, a data center storage farm, or a car. In fact, there is an entirely new chapter in the ISO 26262 specification that applies functional safety concepts for semiconductors.

“Before this the ISO 26262 spec was focused on electrical boxes,” said Kurt Shuler, vice president of marketing at Arteris. “Now semiconductors are a key part of the specification. The thinking is that for functional reliability, it’s less risky to add that into hardware. And if you implement safety reliability in software, the software changes later in the flow so version control becomes more difficult.”

This is reflected by the resurgence of self-test technology, an older technology that is being adopted by the automotive market to gauge whether chips are functioning correctly.

“We’re seeing a huge push by these companies into power-on self-test,” said Joe Sawicki, vice president and general manager of Mentor Graphics‘ Design-To-Silicon Division. “This is a whole second life for this technology. When the chip starts up, the system can determine if it is alive or dead. That doesn’t make the chip more reliable. For that we’re also seeing a big push for built-in self-test (BiST) in logic. Everyone uses memory BiST, but logic BiST is coming into its own.”

Build it right
Testing a chip while it’s in use is one thing. Building it right in the first place and making sure it’s fully functional is quite another. The new wrinkle in this is making sure it’s functional for specific market segments.

“We’re getting a lot more questions regarding reliability and safety considerations,” said Bill Neifert, director of models technology at ARM. “So how do you do fault injection for safety, reliability and security? That’s a common topic, and engineers are looking for complete visibility so they can inject and monitor those faults.”

Neifert noted that the need for reliability extends into other markets, as well, that might not be so obvious. “With the IoT, security and accuracy are critical. The last thing you want to do is turn out a device that’s a hacker’s dream. As we’ve seen with some of the recent IoT device security issues in the popular press, an insecure IoT device can make your entire home network insecure.”

Increasing complexity doesn’t help matters, and no matter what process technology is being used, chips everywhere are getting more complex.

“Modern chips are hardware and software masterpieces, but putting all the elements and functions into one device makes the quality assurance process a real challenge,” said Zibi Zalewski, hardware division general manager for Aldec. “The scope of testing is very wide and requires integrated verification solutions to actually test the chip at SoC level. Even submodules of the SoC are big ‘chips’ already. Testing in separation will detect and resolve module-level issues, but an integrated verification platform introduced early in the process will increase reliability of the whole chip and help to manage interfacing dependencies. Since the new hardware for ASIC projects is practically driven by software running on the chip, it becomes a must to start the verification of firmware as early as possible in the chip design process. Simulation and co-simulation is no longer enough or simply too slow to handle the software aspects. UVM methodologies are helping with that process, but it is necessary to run thousands of hardware-dependent software tests to fulfill the testing. Integrated testing covering software and hardware elements of the new chip increases the quality and shortens the overall quality process.”

Still, quality over time is an unbounded problem with no proof points other than what happened in past designs. Predicting the reliability of designs that are more complex, or which use new technology, is a best-guess scenario, especially when there are so many variables.

“There are three aspects to reliability,” said Anupam Bakshi, CEO of Agnisys. “One is that the design needs to be correct by construction, with automation of the ‘specification-to-design’ process. Second, the tests need to be portable and scalable so that the designs can be verified and the silicon can be validated to be correct. Third is the reliability of the foundries. Assuring reliability gets harder as we move to advanced nodes due to the newness of the process.”

Reliability also means different things in different markets. In the data center, the shift from tape to rotary storage was considered an improvement in reliability. The shift from spinning disks to solid-state drives is less certain.

“They’ve been reticent to use SSD because there are only so many flips between ones and zeroes before it’s dead,” said Arteris’ Shuler. “Now you have multi-terabyte NAND connected to a cluster of controllers, and the whole thing is to make sure of the integrity of the bits. We think of reliability in terms of five 9s, but for them it’s closer to ten 9s.”

Assurances of reliability in mission-critical markets are as important as they are in safety-critical markets, and it affects every part of the supply chain, from design through manufacturing.

“In the last six to eight weeks I’ve spoken to six customers about this subject,” said Hitesh Patel, director of product marketing for the Design Group at Synopsys. “The trends we see are that design sizes are increasing and the number of scenarios is increasing, so you need to test in different modes, such as idle or operational. At older nodes, static analysis was sufficient. Now you need dynamic analysis, but the analysis results are only as good as the vectors you create. We’re seeing users trying to get vectors right out of emulation to use in voltage drop analysis. But if you have a 100 million-instance design, some tools take seven to eight days to run. You may not have time to fix all of the issues. The more you can do early in the process, the higher the likelihood that it will get done.”

That’s one of the reasons there has been such a big push into “shift left,” where more is done earlier in the design cycle. It’s also one of the reasons why all of the major EDA vendors are seeing steady growth in , prototyping, and other tools that can link the front end of design more tightly to the back end.

Bigger trends
Reliability also can be viewed on a macro level. Consolidation is being driven by rising costs of design, as well as the need to bring different skill sets into the design process. It’s uncertain whether that will have a direct bearing on reliability, but it certainly could provide the right level of resources for extensive verification and debug of designs if combined companies decide to put their resources there. So far, that has not been determined.

New approaches to packaging are another unknown. As becomes more expensive to follow, many companies have begun developing chips based on fan-outs and 2.5D architectures. While the multi-chip module approach has been around since the 1990s, putting the pieces together using interposers and high-bandwidth memories is new. How those designs perform over time is unknown.

“This question can be answered in theory, but not in practice,” said Mike Gianfagna, vice president of marketing at eSilicon. “Based on that theory, a silicon interposer is stable and it’s mostly passive. Ideas such as metal migration and warpage are well understood, and the interposer does not add unreliability. But it’s not any less certain with advanced nodes, where you have tunneling effects, and gates are a certain number of atoms wide. The bigger issue is what happens to the speed of any of these chips over 10 years. The effects will become more pronounced at higher temperatures. That’s becoming more of the issue to contend with.”

Nick Heaton, distinguished engineer at Cadence, agrees. “The big question is how these designs handle progressive degradation. In automotive, we’re seeing a lot more functional safety tooling for tolerance of single or multiple failures. There’s a long way to go in this space, though. What does 28nm look like over five years? What we can do is maximize coverage at all levels. But use cases are still the real problem. You cover that with all the permutations you think you can get away with.”

Heaton noted that some of the teams developing advanced SoCs are comprised of hundreds of engineers. But he said that even with those large teams, there are still limited resources. “They have to be smart about what they’re testing. They run a certain number of low-level tests for hardware, for software, for hardware accelerators, for the operating system. That’s where we are at the moment.”

So how far do companies extend technology? That may be a much more interesting question as SoCs and new technologies begin getting adopted into safety- and mission-critical markets and have some history behind them.

“The big question is always how hard you’re pushing the technology,” said Drew Wingard, CTO of Sonics. “The closer to the edge, the less reliable it will be. We are on the edge of some very interesting tradeoffs over packaging complexity, known good die, and economics, meaning who’s to blame. What is the pain-to-gain ratio? The gain normally has to be very high, but experience can help change that.”

Whether it adds good metrics for reliability remains to be seen. At this point, however, there are too many unknowns to draw conclusions about what goes wrong, what really causes it to go wrong, and who’s responsible when it does.



4 comments

olivier lauzeral says:

Great article Ed.
You are emphasizing on some key problems that are affecting maybe the fastest growing semiconductor markets: automotive and IoT.
You are correct: Safety is a combination of risk probability and impact of the failure on the system/application. Reliability is mostly assessing failure risk, but usually doesn’t take care of the impact. In complex systems where many components (HW and SW) are interacting which each other, it is important to analyze the impact of a bit failure, or in other words study the propagation of this failure all the way to a system crash, or silent corruption of important data.
Actually in the vast majority of cases, the failure will disappear, it will be naturally absorbed by the system and not create any observable issue. We call this probability of propagation to the system “derating” (architectural derating, application derating), others use AVF (Architecture Vulnerabiliy Factor). Best way to calculate these key factors? emulation, or smart simulations that can reduce the prohibitive time of fault injection (several days are mentioned in your article) into a more manageable duration. There are methods and few existing EDA tools for that. There won’t be any cost effective mitigation unless there’s an accurate quantified assessment of the situation. And the best solution to meet real customer needs (in term of failure rate) will be a combination of SW, HW design, and process technology sensitivity. One needs to address these three dimensions for optimum results (performance and cost).
Oh, and by the way, measured device failure rate due to Soft Errors (natural radiation based) dominate hard error rate (Which you mentioned mostly in your article) by sometimes 1000x, without AVF (affecting mostly SER). So ageing, electromigration and HCI are not the only threat!
More on this, check: http://thesofterrorexperts.blogspot.com/

Ed, I think you’re onto something here and I agree with most of your statements but I have a few points you and your survey interviewees failed to address.

You mention that reliability measurements are becoming harder to measure and more difficult to achieve. I disagree. While a lot of reliability tests focus on discrete devices, complex devices offer a greater flexibility to test as black boxes – dynamic input, at much faster speeds, with a desired yet known output. Changes in timings, power amplification, impedances and current/voltages seem to make instrumentation and measurements easier under test. Keep in mind that accelerated tests can use plug-n-play digital testers on large sample sizes (and wafers) rather than manually controlled DMMs. There are several guidelines and standards available that address this: VITA 51.2, SAE 6338, AEC Q100 and others like ISO 26262 which have been updated to reflect the need to address reliability of semiconductors. My quick retort was that Xilinx and producers of AEC-grade components do it all the time. DFMEAs are one tool in a toolshed just like DfR’s Sherlock tool is for board level designs.

While definitions seem murky, they truly aren’t. I’ve performed too many failure analyses to know the difference between a “walking-wounded” device that has been failed due to a performance parameter or ESD, then one that catastrophically failed. Their electrical signatures are very different as are the physical manifestations of different failure mechanisms. I conduct most of DfR’s IC Wearout testing to a critical performance parameter – that affects the expected outcome of a system. DfR Solutions has been engaged with aerospace, space and automotive markets for the last several years helping to define both test and predictive methodologies to predict lifetime of complex ICs. We’ve been able to apply, at times create and verify physics of failure models for semiconductor, interconnects, bond wires and PCBs. If designed right, force majeure can be mitigated at the box or system level and ESD/EMI/transients can be designed out with protection devices. Reliability needs to be addressed at each hierarchy of system design.

Security follows the same vein as reliability and should be assessed starting at the early design stages. I could talk about this for days with my many security roles and CISSP training. However the simple points to make are that ASICs, security masks, IP protection, and supply chain management have made chips very robust in some segments. This becomes a tradeoff of cost and time to market insertion. If the IP is easy to duplicate, then a $100M mask out doesn’t make sense.

Longevity can be assessed, established and modeled through a proper set of accelerated tests. While foundries don’t often support customer-driven applications and environments, some do and some non-foundries do it better. It’s becomes a function of being able to parametrically fail a device under controlled conditions with repeatability. New technologies always bring uncertainties. That’s the risk-reward proposition with new technologies – will the Hyperloop be successful?

Self-test technologies have been around for years. Most programmable devices that I’ve seen have instituted some form of logic test or code verification upon boot. Triple module redundancy (TMR), error correction (ECC, EDAC) and voter systems also prevent hangups in software and hardware logic.

Neifert’s note on IoT security seems to be a confusion between operational security and product level security. I’ve been in that market for years (before it was called IoT) and the recent news postings have shown that haste causes poor engineering decisions. Insecure baby monitors and IP cameras that were quick to the market to keep up with the IoT hype shouldn’t be the exemplar comparison here. I agree with Zalewski as using a functional simulation test bench using accelerated test techniques (i.e. ovens and chillers) as early in the design as possible will increase the reliability of a whole chip design. I’ve tested microcontrollers, CPUs, GPUs, Flash, FPGAs, Ethernet, optical devices … for component manufacturers and OEMs. SSDs aren’t as reliable as platter drives unless you have a limited duty cycle or such high redundancy behind the FTL that the user won’t know their device has lost 30% of its bits. Seldom should you test at one operational state. You should always test the operational modes in a device because they’re there and they’ll be used in its application environment. Not only is this applicable to reliability but also for bug identification. Finally, don’t be too quick to say high temperature environments are the harbinger of failure, I’ve seen 28nm nodes fail sooner at room temperature and closer to freezing then at high temperatures. New packages and interposers have models and FEA can be performed to see the interdependencies of the materials.

There are less unknowns then one would think.

Kev says:

“Longevity is guesswork” – actually it’s more a mix of statistical, electro-migration and mechanical fatigue modeling. However, most people prefer to skip the work and guess. For digital design guys who think in 1s and 0s, it’s outside their skill set.

kpc says:

“What’s clear about all of these markets is that reliability is now a selling point as well as a requirement.” You are right and this is important. Reliability assurance has traditionally been a necessary cost of doing business and the tendency for management is to minimize that cost. As a result, reliability always get short-changed. Reliability should be treated as a value-add investment. It is a profit center – one that can promote brand and allow product to be successfully sell at a higher price point. What is needed is a way to measure return-on-investment (ROI) on reliability in the same way as any other investment. Can you figure out how to do that?

Leave a Reply


(Note: This name will be displayed publicly)