Making Chips To Last Their Expected Lifetimes

Lifecycles can vary greatly for different markets, and by application within those markets.


Chips are supposed to last their lifetime, but that expectation varies greatly depending upon the end market, whether the device is used for safety- or mission-critical applications, and even whether it can be easily replaced or remotely fixed.

It also depends on how those chips are used, whether they are an essential part of a complex system, and whether the cost of continual monitoring and feedback can be amortized across the price of the overall system. It’s not a simple equation, and there are no simple answers, particularly at advanced nodes where extra circuitry can increase power consumption and reduce performance.

“To be able to predict reliability and performance changes by having a baseline understanding of chips and being able to extrapolate from that to accurately predict when things are going to start failing is the Holy Grail,” said Steve Pateras, senior director of marketing for test products in Synopsys‘ Design Group. “You need to be able to do performance and security optimization and ultimately maintain reliability on an ongoing basis. And if a car is going to start failing, you want to get off the road.”

In the past, reliability often was considered part of the manufacturing process. Chips would be baked in an oven for a period of time or subject to carefully monitored vibration to determine when a device would fail. While different scenarios could be simulated prior to manufacturing, chips could still fail, but the defectivity rate would decrease depending upon how tightly various processes were controlled and how much margin could be added to provide some type of failover. This is evident today with ECC memory.

The challenge going forward, particularly at advanced nodes, is to be able to leverage existing circuitry and limiting the amount of redundancy, which can impact power and performance. This means failover needs to be more targeted to critical functions, and it requires much more understanding of what is happening on a chip or in a package at any point in time.

“With a chip in the field for an extended period of time, then scientifically we know what the physics of failure behind that are,” said Noam Brousard, system vice president at proteanTecs. “How do we monitor and measure it over time and deduce the implications of these results? A straightforward way is to continuously test how the chip is behaving and if its performance is degrading over time. This falls short on multiple fronts. When monitoring the functional reaction to degradation we might only notice the effects of the degradation once they actually cause a failure, or we won’t know the actual physical cause for the degradation. And maybe we won’t be able to use this knowledge to predict when in the future the failure will happen. Another approach is to use telemetry, based on what we call Agents, embedded in the silicon. Those are always monitoring the physical parameters of the silicon at the transistor level, mapping their degradation over time. You can plug that data into well-known formulas for hot carrier injection or NBTI, which cause aging in chips, and run ML algorithms that monitor the degradation rate and from there you can calculate and predict time to failure ahead of its actual occurrence.”

Synopsys’ Pateras agrees. “Part of this is not just figuring out what’s going to happen,” said Pateras. “It’s providing information so that, in some cases, the infrastructure is performing corrections. You need to have redundancy, and critical processors have to be dual-lockstep. You also need to be monitoring this kind of activity.”

Fig. 1: Industry estimates on expected lifetimes of chips. Source: Industry estimates/Semiconductor Engineering

Fig. 1: Industry estimates on expected lifetimes of chips. Source: Industry estimates/Semiconductor Engineering

Data management
A key aspect to lifecycle management is being able to collect and move data freely up and down the design-through-manufacturing chain. In the past, much of this data has been kept in silos, and while that proved an efficient way to manage individual process steps in the past, the growing complexity of chips, packages and systems makes that data useful for ensuring reliability across all of those devices.

“Sometimes, if you know how an etch tool or a lithography tool behaved, that could influence what additional tests you want to do,” said John Kibarian, president and CEO of PDF Solutions. “Tests can benefit from knowing what happened in the processing tools upstream. Customers can make more informed decisions about what to test, when to test, and whether to put additional stressing on certain parts because of a particular etcher or processing capability. So the value of sharing all that data can make every individual system more effective. It’s not all about more testing. It’s about better testing and smarter testing.”

That requires a much more integrated approach, but there is a potentially significant payback. “There are cases where it doesn’t really make sense to do a final test if you’re going to do a system-level test, anyway,” said Doug Lefever, president and CEO of Advantest America. “And then there is migration of things upstream into wafer sort and everything in between. It would be myopic for us to think about just traditional ATE insertions. You need a more holistic view, including things like virtual insertions, where you could take data from the fab or a previous test and combine that with an RMA or in situ things in the field, and use that data to do a software-based insertion with no hardware at all. We’re looking at that right now.”

The data trail can be extended to the end user, as well. “You can combine the data all the way back to production to track down a problem in a car based on certain characteristics,” said Uzi Baruch, vice president and general manager of the automotive business unit at OptimalPlus, which is a division of National Instruments. “This goes beyond test data. You can take measurements on the motor itself to figure out the root cause of dimensional data and you can compare measurements.”

But it’s also important to be able to filter out data that is not essential to determining whether a device is functioning properly, where the anomalies are, and charting that to how that device is performing over time. As devices become more complex, particularly at advanced nodes where there are multiple power rails and voltages, all of this becomes much more challenging.

“Sometimes you aren’t even aware of the problem,” said Baruch. “The data from a car has to be connected back to the production line in case problems do show up. But you also have to screen out stuff such as weather data, where there is no direct correlation to how the system is functioning. If you’re doing root-cause analysis, you need data that is relevant.”

Disaggregation issues
This potentially becomes more difficult as various different processes are unbundled from an SoC. The rising cost of developing chips at the most advanced nodes, coupled with the decreasing power/performance improvements at each node, are forcing a number of chipmakers to offload various functions that don’t scale well, or which aren’t critical, into separate chips in a package.

“We can create heterogeneous integration that allows us to put chips into a system and put in completely different type of chips — whether it’s analog, mixed signal, digital, or even sensors — all on the same platform for this overall integration conversion that we see today,” said Ingu Yin Chang, senior vice president of ASE, in a recent presentation. “There are various types of integration, whether that’s silicon photonics for onboard optical solutions, and power integration, where you try to achieve certain power for high-performance computing. But die partitioning can improve yields, the ability to integrate high-bandwidth memory, and to use various IP to reduce the power.”

All of those factors can impact the life expectancy of a system. There is less heat, less circuit aging, a reduction in electromigration and other physical effects, and potentially a longer lifetime for dielectrics and other films. But there still are some kinks to work out of the supply chain.

“The challenging part is that memory stack is provided by one customer, the SoC is provided by a different customer, and then the packaging house provides the silicon interposer and packages everything together,” said Alan Liao, director of product marketing at FormFactor. “With only one chip, you can tell if it’s good or bad. But when a packaging house takes one die from Company A and one die from Company B and integrates them together, if that whole package fails, then who takes the responsibility? So you need to get the necessary data to work out the business model. We test memory, SoCs, and the interposer, and once they are put together we test everything again. Usually that happens at the test house. They collect all the data and analyze it.”

This is easier said than done in complex packages. Pressure on the testing side needs to be precise.

“If the force isn’t enough, you may not get a stable contact,” said Liao. “If it’s too high, then all the bumps are going to collapse. But there are just too many bumps to test at the same time, so we’re looking at advanced MEMS processing in our fab to produce probes for those features. It’s challenging, but so far from a probe card perspective it seems to be okay. But if we keep pushing in that direction, probably new materials and new processes will be needed.”

Microbumps are increasingly problematic. The idea behind microbumps is that more connections can be made to a chip, which in turn allows more signals to be routed between different die. But bumps also are suffering from some of the same problems as other types of scaling.

“The advantage of microbumps is that with very fine resolution, you can start processing on chip A and run it to chip B, and the logic available from the chip above may be closer than if you were to route a signal on a single chip,” said Marc Swinnen, director of product marketing at Ansys. “This is where where we’re heading with complete 3D design. The first iteration of this is bumping, where in principle you can break up a single die and effectively double the reticle size, or potentially triple it if you put bumps on both sides of a chip. But reliability is a huge challenge. There is mechanical stress and warpage, and there is a potential for thermal mismatch.”

Shifting left, right and center
One big change that will be necessary to manage the entire lifecycle of chips is a free flow of data across different parts of the flow and out to the field, with a full loop back all the way to the architects of the system.

“It’s not just about when chips are in the field,” said Synopsys’ Pateras. “It starts from design, through manufacturing, through tests, bring-up, and ultimately in field. If you start amassing information and understanding early on, that can be used in the later stages. So as you get to the traceability, you can do correlations and baselines for prediction if you have the design and manufacturing data analyzed. We’re proposing an approach that has two basic components. One is visibility into the device. So it’s monitors and sensors intelligently embedded into the chips. And then these monitors and sensors will provide rich data about what’s going on, and we can extract that data. And all of these different phases of life cycles — design, simulation, manufacturing, yield ramp, test, bring-up optimization, and then in the field — provide rich data that we can analyze and ultimately feed back throughout the life cycle stages.”

The increasing adoption of AI/ML in manufacturing can help significantly in predicting how devices will age, but its reach across multiple elements on a board or in a package is limited to the amount of good data available.

“Machine learning is really helpful for the surface defect detection,” said Tim Skunes, vice president of technology and business development at CyberOptics. “And we are looking at various machine learning algorithms for our MRS (multi-reflection suppression) technology. We design very complex imaging systems, and in any given imaging system there are probably 1,000 different variables that can affect the final imaging performance. So you’ve got design for manufacturability, and how you partition the tolerances becomes extremely critical.”

Virtually all major equipment and tools vendors are working with some level of machine learning these days, and looking how to extend that even further.

“Machine learning surprisingly enters into both our MEMS fab processing as well as our design process,” said FormFactor’s Liao. “So each of those probe cards, is custom made to a particular application or chip. Being able to optimize that probe card is the key to getting high yield on the wafer, and we apply a large customer database for a type of application or design. They can utilize our library with machine learning and find that in the past there were 10 designs that matched this. Now that is automatically done behind the scenes using our design tool.”

The emphasis on reliability, traceability, and predictability is growing across multiple markets. While the automotive industry was the catalyst for all of this, it has spread to other markets where chipmakers are looking to fully leverage their investment in chip design. Having it fail prematurely in the field has economic consequences, either through a recall or a replacement.

Aligning all the necessary pieces across all process steps is a mammoth challenge, and one that will require much tighter integration of a complex and global supply chain. If that can be achieved, huge efficiencies are to be gained on the business side and significant advances possible on the technology side — and a whole bunch of new opportunities to enable this shift.

Monitoring IC Abnormalities Before Failures
Deep and widespread dedicated circuitry for monitoring internal states supports deeper analytic insights for engineers
Chiplet Reliability Challenges Ahead
Determining how third-party chiplets will work in complex systems is still a problem.
Reliability Challenges Grow For 5/3nm
New transistors, materials and higher density are changing the testing paradigm.
The Quest To Make 5G Systems Reliable
Experts at the Table: Where are the challenges and what can be done to solve them.
Ensuring HBM Reliability
What can go wrong and how to find it.


Tanj Bennett says:

It is not just the length of time, but the time under power. A vehicle’s power-on lifetime may be 5,000 hours in 10 years, while a cellphone may reach that in 10 months. The vehicle environment may include a hot engine but the phone may be lying out in full sunlight, or playing video for hours while flat on cloth with no heat path. So, product stress factors can be counter-intuitive.

SpeedRazor says:

A chip on Nasa’s Voyager 1 failed after more than 40 years, it’s feedback could serve as a measure point.

Leave a Reply

(Note: This name will be displayed publicly)