Different Ways To Improve Chip Reliability

Push toward zero defects requires more and different kinds of test in new places.


A push toward greater reliability in safety- and mission-critical applications is prompting some innovative approaches in semiconductor design, manufacturing, and post-production analysis of chip behavior.

While quality over time has come under intensive scrutiny in automotive, where German carmakers require chips to last 18 years with zero defects, it isn’t the only market demanding extended reliability. Smartphone makers now demand that chips work for a minimum of four years, versus the previous two-year lifecycle. And in some industrial and IoT applications, where replacing sensors is difficult, some chips need to last 20 years or more.

What is new is that some of these chips are being developed at leading-edge nodes, sometimes with heterogeneous compute elements. Automotive logic, for example, is being designed at 7nm, despite the fact that there is no history of using advanced-node chips under high-stress conditions involving heat, cold, and almost continuous vibration. And in high-performance chips, where at least some circuitry is always on, simulations now must include the impact of various operating modes to show how circuits age, how systems function over time, and how accurate sensors will be and how often they need to be recalibrated.

This has given rise to a number of different approaches to improving reliability that go well beyond just the typical test and inspection. Among them:

• IP with built-in test.
• In-circuit/on-chip monitoring.
• Machine learning to spot patterns in data.
• More testing in different places.

Changes in IP
Commercial IP vendors are adding test capabilities into their products in order to meet demands for increased reliability in automotive, industrial, and medical applications.

Integration of IP always has been an issue, and that typically has been handled through better characterization and an emphasis on proving that this IP has been proven in silicon. But as IP is added into devices used in safety-critical applications, being able to test it throughout the lifecycle of the chip or system is becoming more important.

“Customers want test capabilities built into their IP,” said Suk Lee, senior director of TSMC’s Design Infrastructure Management Division. “That’s happening even with high-speed analog. There is a loop-back mode for self-test. The IP industry is participating more heavily in this, and we’re starting to pay more attention to it.”

One of the key drivers is automotive reliability, where failures — and especially recalls — are expensive both in terms of dollars and brand reputation. A 2018 report by AlixPartners noted that a faulty General Motors ignition switch cost the company $4.1 billion, while the cost of replacing defective airbags cost Takata $1 billion (and automotive OEMs even more money). Costs have been rising since then, and so has the risk assigned to the automakers and their supply chain as more safety features are added into vehicles.

“There are a couple of key trends,” said Ron Press, technology enablement director at Mentor, a Siemens Business. “One is plug-and-play, where you embed test technology into different systems. That could be small IP blocks, or it could be processor IP. Whatever is done with test, you plug it into the system. So you may have test controls embedded into it. The IJTAG standard says, ‘This is how to interface to a system, and this is how to tell it what to do.’ As everything gets more sophisticated, more test is embedded in it. So with the IJTAG (IEEE 1687) interface, you have on-chip clock controllers inside. There is a high-level reset that is control independent, so if you spot patterns, those are made independent and now built-in self-test can go check the results.”

The second trend involves AI. “As design gets more complex, there is more tile-based design,” Press said. “But for each block there is not necessarily logic at the top, and those blocks abut other blocks, so you don’t have the ability to add more DFT logic. So now you have to make a self-test decision about when to run that, and when to interrupt it. In automotive, you can apply specific patterns and do a test between blocks, and once it’s connected, you can test for any patterns. Basically you’ve changed the level of abstraction where you bring it up a level for the user but also go down a level to test for more subtle defects.”

In-circuit/on-chip monitoring
Finding those more subtle defects involves understanding how the various components in a chip or package interact with each other and how they’re supposed to behave. In effect, it requires a baseline for what is considered normal behavior and what is considered an acceptable distribution for a chip’s functionality beyond that baseline.

“The bigger names in the industry want visibility into their chips,” said Noam Brousard, vice president of product at proteanTecs. “Of course, it’s about about how they degrade in the field, but it’s not just that. Making sure you have the data granularity that comes with on-chip monitoring ensures parametric coverage. This increases quality by an order of magnitude, weeding out false-positive/ false-negative outliers, systematic issues, latent defects and so much more. And if something does happen, now they can trace it back to the source and essence of the problem. You can test everything to the bone, but when something goes wrong you want to be able to call someone. Right now there’s a lot of finger pointing going on. With Agent-embedded chips, they all speak the same language and you can follow the problem back to the source, whether that’s in the design of the chip, the production of the chip, or the production of the board.”

The ultimate goal is to self-correct problems discovered in the field, but that kind of capability is still in early research.

“The first part is to analyze the design using machine learning algorithms and embed Agents that continuously create new data on the chip’s profiling, health and performance,” Brousard said. “Then that data, both from the design analysis and from the Agent readouts, is uploaded to the analytics platform and inferred. You can add or drop different scenarios into the lifecycle of the chip and connect the dots yourself, or use a more structured approach for investigating issues, using machine learning data analytics to learn things that weren’t visible before, but for which we now have the data. The data analytics is used to shine a light on even the most obscure parts of your chips, systems and products throughout their lifecycle.”

Identifying patterns
If irregularities are found in chip designs and manufacturing processes, that data needs to be looped back back into designs so the next generation of chips can be fixed. But this comes down to more than just understanding distributions of data. It also requires an understanding of which data is most important for a particular market or application.

“Patterns require a blend of machine learning and domain expertise,” said Kevin Robinson, director of customer success at yieldHUB. “It’s very easy to get lost with all of this data. Before you start machine learning you need to consider why you are using it in the first place. What do you want to get out of that data? That allows you, once you get results, to prioritize actions. It’s much more straightforward that way.”

Straightforward, yes, but not easy. “In more complex system companies, they need a vision of which is the key data for a particular product line,” Robinson said. “So they need to learn the different relationships between data. This is a different focus than what they did with data in the past. One of the problems is that as companies get larger, it makes it difficult to be dynamic. There are large companies that can do it, but it requires empowered divisions. You see this in some of the largest companies, where there are almost autonomous units with competing products even though it’s the same parent company.”

More tests in different places
One of the keys is doing more testing, but that also requires a standard model against which tests can be run. This is the whole idea behind a digital twin, but it’s becoming less well defined as machine learning systems begin to optimize for different environments and use cases.

“Right now you’re separating different components in a system,” said Jeff Phillips, head of automotive marketing at National Instruments. “Let’s take an autonomy platform. In one scenario, you would try to isolate just the hardware and at least validate that the sensors and the input are capturing the right data. You can isolate the software from that and do a physical measurement, electronic test and validation. But then, when you plug in the software, we’re seeing most companies are relying on a set of preconfigured scenarios they’re testing against, and then using simulation to broaden that to the whole scenario set.”

This works to a point, but it doesn’t necessarily pull in all the corner cases, and results are more approximate than clear-cut.

“The challenge becomes the difference between deterministic testing versus probabilistic testing,” Phillips said. “In deterministic testing, you know the answer. You can verify it’s correct. In probabilistic, you’re verifying that’s probably the right answer. The algorithms that run the autonomy platform in a car are generally made by machine learning. They’re not human-written, and they’re not easily dissectible or readable for a software team to go in and code check. So you can get the right output from a scenario in the lab, but one minor variation in all of these variables that you’ve made static in the lab can create trouble in the real world. That’s where accidents happen, and we can’t test in the environment the cars will live in because it’s dangerous.”

He’s not alone in seeing that. “In automotive, the biggest issue is reliability,” said John Hoffman, CyberOptics. “You want to make sure there are zero escapes. One customer we have manufactured 25 million circuit boards and found 5 defects. That’s a huge problem on the algorithm side.”

One of the major problems CyberOptics and others have encountered involves solder balls. “There is a ton of variety of solder balls,” said Hoffman. “The variability in the surface is enormous, so you have to deal with non-uniformity. With the fabs, the process is dialed in so there is a lot less variability. With OSATs, this is a tougher problem to solve.”

It is important to understand where problems creep in, though, and good data coupled with machine learning can help with that.

“Let’s assume that we analyze images and now we have a whole bunch of parametric data that we have extracted,” said Sam Jonaidi, vice president of automotive solutions at Optimal Plus. “For example, let’s say a scratch is indeterministic. You don’t know where the scratches are going to occur. That could happen anywhere on the panel. So if I can find the location of the scratch, the characteristics of scratch, how thick is it? How wide is it? How long is it? What shape it is in? I can feed that into my machine learning algorithms. And now I can operate on the previous step, I can go back and say, ‘Okay, what are the primary factors that could cause this?’ So again, we use machine learning for that process, because the parameters are so large that it cannot be done otherwise.”

Advanced node designs
It doesn’t help that the most critical logic in a vehicle is being developed at the most advanced process nodes.

“More than 50% of the failures at 0km come from electronics, which includes boards, modules, and semiconductors” said Oreste Donzella, executive vice president and chief marketing officer at KLA. “The automotive industry has been a slow adopter of advanced technologies due to reliability concerns. Carmakers historically want to be two or three nodes behind smartphone or server industries. But now there is no choice. If the automotive industry wants to make progress in autonomous driving and connectivity, it needs to adopt more and more advanced nodes. We already see plans to introduce 7nm semiconductor technology in cars, which will potentially increase the risk of reliability failures.”

KLA advocates collecting all data during the manufacturing process and using machine learning to find patterns when correlating that data to electrical testing.

“A massive amount of data today is generated in the wafer fabs and assembly lines,” Donzella said. “This piece of information is used for process control and line monitoring, but not for screening. When combined with other electrical and parametric testing results through machine learning methodologies, in-line data can make the fab inspection strategy more intelligent and relevant. KLA calls this methodology I-PAT, or in-line parts average testing. We are now at early validation phase at few semiconductor wafer fabs and the first results are promising.”

How much of this data ultimately is shared with carmakers and chipmakers remains to be seen. Fabs traditionally have been very possessive of data that can make a real difference with reliability. “However, the reliability problem is so critical in the automotive industry that I believe players will be more collaborative to share information under a strict protocol,” he said. “So far, I-PAT has been very well received by the entire automotive ecosystem, including OEMs and Tier 1s.”

New approaches ensuring sufficient coverage increasingly go well beyond the old methods of testing once during manufacturing and binning the parts that don’t measure up to spec. In a variety of markets, reliability increasingly is a marriage of continuous testing, more data analytics and domain expertise to prioritize the most important data.

The semiconductor industry has no choice but to improve reliability of what it designs and manufactures, because liability is now being apportioned across the supply chain, particularly in safety-critical markets and at the most advanced nodes. Nothing in electronics is perfect, but the goal is to get design and manufacturing as close to perfect as possible, and that is spawning some innovative approaches at every step in the design through manufacturing flow.

—Susan Rambo contributed to this report.


Jerry Cohn says:

That is why many companies have increased their use of low alpha solder bumps to eliminate soft errors caused by alpha particle emissions in their ICs.
Jerry Cohn

Leave a Reply

(Note: This name will be displayed publicly)