Maybe, but metrics are murky for new designs and new technology, and there are more unknowns than ever.
Reliability is emerging as a key metric in the semiconductor industry, alongside of power, performance and cost, but it also is becoming harder to measure and increasingly difficult to achieve.
Most large semiconductor companies look at reliability in connection with consumer devices that last several years before they are replaced, but a big push into automotive, medical and industrial electronics using advanced SoC designs has raised the bar. In those markets, chips need to function for 10 to 15 years, rather than the 2-to-5 year lifespan for many consumer devices. Moreover, chipmakers selling into those markets are being held to much tougher standards than in the past, even if their chips are not designed for safety critical systems, because everything in a car is now connected.
What’s clear about all of these markets is that reliability is now a selling point as well as a requirement. A JD Power study released earlier this month said reliability concerns are one of the reasons car buyers choose a brand or avoid certain models, with 55% of new-vehicle buyers citing reliability as a top reason for buying a car versus 51% in a 2015 study. The reverse is true, as well: 17% said they are now avoiding certain models, compared with 14% last year.
But reliability in chips isn’t so easy to measure for a variety of reasons:
• Definitions are murky. Reliability is a measure of functionality over time. If a chip completely stops working, it is deemed unreliable. But in many cases performance degrades over a period of years, whether that’s due to electromigration, layer upon layer of software patches, memory bit failures, or a host of other issues that can crop up. Reliability in these cases may be more subjective than definitive.
• Security is never guaranteed. Security is a new measure of reliability, and the longer a device lasts, the more likely it can be breached. The knowledge base of hackers continues to expand, and tools and capabilities improve. If a five-year-old device is hacked using technology that was not invented when the device was created, should it be considered unreliable?
• Longevity is guesswork. Most advanced semiconductors are reliable enough for devices that last two years. Some of them could last 50 years. But the reality is that no one knows what problems will erupt with 16/14nm chips, or whether they’re better than 10nm chips, because dynamic power density is a new challenge. Even 28nm has only recently become a mainstream node for development. And as new technologies such as finFETs, gate-all-around FETs, new materials, new interconnects such as through-silicon vias, and new processes are introduced, there are always uncertainties.
• Use cases matter. As semiconductors become more complex, it becomes more difficult to find all of the bugs. Some of those may be use-case dependent, which makes the definition of reliability not only unique for each device, but unique for each user. With more components competing for memory, multiple voltage islands, blocks being turned on and off, and much more software and firmware, the usable lifespan of a chip looks more like a bell curve than a fixed number.
None of this is relevant in a set-top box or a smartphone. But it is relevant in an industrial control system, a data center storage farm, or a car. In fact, there is an entirely new chapter in the ISO 26262 specification that applies functional safety concepts for semiconductors.
“Before this the ISO 26262 spec was focused on electrical boxes,” said Kurt Shuler, vice president of marketing at Arteris. “Now semiconductors are a key part of the specification. The thinking is that for functional reliability, it’s less risky to add that into hardware. And if you implement safety reliability in software, the software changes later in the flow so version control becomes more difficult.”
This is reflected by the resurgence of self-test technology, an older technology that is being adopted by the automotive market to gauge whether chips are functioning correctly.
“We’re seeing a huge push by these companies into power-on self-test,” said Joe Sawicki, vice president and general manager of Mentor Graphics‘ Design-To-Silicon Division. “This is a whole second life for this technology. When the chip starts up, the system can determine if it is alive or dead. That doesn’t make the chip more reliable. For that we’re also seeing a big push for built-in self-test (BiST) in logic. Everyone uses memory BiST, but logic BiST is coming into its own.”
Build it right
Testing a chip while it’s in use is one thing. Building it right in the first place and making sure it’s fully functional is quite another. The new wrinkle in this is making sure it’s functional for specific market segments.
“We’re getting a lot more questions regarding reliability and safety considerations,” said Bill Neifert, director of models technology at ARM. “So how do you do fault injection for safety, reliability and security? That’s a common topic, and engineers are looking for complete visibility so they can inject and monitor those faults.”
Neifert noted that the need for reliability extends into other markets, as well, that might not be so obvious. “With the IoT, security and accuracy are critical. The last thing you want to do is turn out a device that’s a hacker’s dream. As we’ve seen with some of the recent IoT device security issues in the popular press, an insecure IoT device can make your entire home network insecure.”
Increasing complexity doesn’t help matters, and no matter what process technology is being used, chips everywhere are getting more complex.
“Modern chips are hardware and software masterpieces, but putting all the elements and functions into one device makes the quality assurance process a real challenge,” said Zibi Zalewski, hardware division general manager for Aldec. “The scope of testing is very wide and requires integrated verification solutions to actually test the chip at SoC level. Even submodules of the SoC are big ‘chips’ already. Testing in separation will detect and resolve module-level issues, but an integrated verification platform introduced early in the process will increase reliability of the whole chip and help to manage interfacing dependencies. Since the new hardware for ASIC projects is practically driven by software running on the chip, it becomes a must to start the verification of firmware as early as possible in the chip design process. Simulation and co-simulation is no longer enough or simply too slow to handle the software aspects. UVM methodologies are helping with that process, but it is necessary to run thousands of hardware-dependent software tests to fulfill the testing. Integrated testing covering software and hardware elements of the new chip increases the quality and shortens the overall quality process.”
Still, quality over time is an unbounded problem with no proof points other than what happened in past designs. Predicting the reliability of designs that are more complex, or which use new technology, is a best-guess scenario, especially when there are so many variables.
“There are three aspects to reliability,” said Anupam Bakshi, CEO of Agnisys. “One is that the design needs to be correct by construction, with automation of the ‘specification-to-design’ process. Second, the tests need to be portable and scalable so that the designs can be verified and the silicon can be validated to be correct. Third is the reliability of the foundries. Assuring reliability gets harder as we move to advanced nodes due to the newness of the process.”
Reliability also means different things in different markets. In the data center, the shift from tape to rotary storage was considered an improvement in reliability. The shift from spinning disks to solid-state drives is less certain.
“They’ve been reticent to use SSD because there are only so many flips between ones and zeroes before it’s dead,” said Arteris’ Shuler. “Now you have multi-terabyte NAND connected to a cluster of controllers, and the whole thing is to make sure of the integrity of the bits. We think of reliability in terms of five 9s, but for them it’s closer to ten 9s.”
Assurances of reliability in mission-critical markets are as important as they are in safety-critical markets, and it affects every part of the supply chain, from design through manufacturing.
“In the last six to eight weeks I’ve spoken to six customers about this subject,” said Hitesh Patel, director of product marketing for the Design Group at Synopsys. “The trends we see are that design sizes are increasing and the number of scenarios is increasing, so you need to test in different modes, such as idle or operational. At older nodes, static analysis was sufficient. Now you need dynamic analysis, but the analysis results are only as good as the vectors you create. We’re seeing users trying to get vectors right out of emulation to use in voltage drop analysis. But if you have a 100 million-instance design, some tools take seven to eight days to run. You may not have time to fix all of the issues. The more you can do early in the process, the higher the likelihood that it will get done.”
That’s one of the reasons there has been such a big push into “shift left,” where more is done earlier in the design cycle. It’s also one of the reasons why all of the major EDA vendors are seeing steady growth in Emulation, prototyping, and other tools that can link the front end of design more tightly to the back end.
Reliability also can be viewed on a macro level. Consolidation is being driven by rising costs of design, as well as the need to bring different skill sets into the design process. It’s uncertain whether that will have a direct bearing on reliability, but it certainly could provide the right level of resources for extensive verification and debug of designs if combined companies decide to put their resources there. So far, that has not been determined.
New approaches to packaging are another unknown. As Moore’s Law becomes more expensive to follow, many companies have begun developing chips based on fan-outs and 2.5D architectures. While the multi-chip module approach has been around since the 1990s, putting the pieces together using interposers and high-bandwidth memories is new. How those designs perform over time is unknown.
“This question can be answered in theory, but not in practice,” said Mike Gianfagna, vice president of marketing at eSilicon. “Based on that theory, a silicon interposer is stable and it’s mostly passive. Ideas such as metal migration and warpage are well understood, and the interposer does not add unreliability. But it’s not any less certain with advanced nodes, where you have tunneling effects, and gates are a certain number of atoms wide. The bigger issue is what happens to the speed of any of these chips over 10 years. The effects will become more pronounced at higher temperatures. That’s becoming more of the issue to contend with.”
Nick Heaton, distinguished engineer at Cadence, agrees. “The big question is how these designs handle progressive degradation. In automotive, we’re seeing a lot more functional safety tooling for tolerance of single or multiple failures. There’s a long way to go in this space, though. What does 28nm look like over five years? What we can do is maximize coverage at all levels. But use cases are still the real problem. You cover that with all the permutations you think you can get away with.”
Heaton noted that some of the teams developing advanced SoCs are comprised of hundreds of engineers. But he said that even with those large teams, there are still limited resources. “They have to be smart about what they’re testing. They run a certain number of low-level tests for hardware, for software, for hardware accelerators, for the operating system. That’s where we are at the moment.”
So how far do companies extend technology? That may be a much more interesting question as SoCs and new technologies begin getting adopted into safety- and mission-critical markets and have some history behind them.
“The big question is always how hard you’re pushing the technology,” said Drew Wingard, CTO of Sonics. “The closer to the edge, the less reliable it will be. We are on the edge of some very interesting tradeoffs over packaging complexity, known good die, and economics, meaning who’s to blame. What is the pain-to-gain ratio? The gain normally has to be very high, but experience can help change that.”
Whether it adds good metrics for reliability remains to be seen. At this point, however, there are too many unknowns to draw conclusions about what goes wrong, what really causes it to go wrong, and who’s responsible when it does.