Reliability Over Time And Space

Challenges in the march toward known good systems.


The demand for known good die is well understood as multi-chip packages are used in safety-critical and mission-critical applications, but that alone isn’t sufficient. As chips are swapped in and out of packages to customize them for specific applications, it will be the entire module that needs to be verified, simulated and tested, and analyzed.

This is more complicated than it sounds for several reasons. First, the expected lifetime of these devices is increasing. Most of the leading-edge chips used in the past were either in servers, where the life expectancy averaged four to five years, or in smart phones, where the life expectancy was two years. The expected lifetime of chips in both of those sectors has increased. In the case of automotive applications, it could be as long as 18 years.

At the same time, the features inside a chip have decreased. The result of decades of device scaling is more noise from power, electromagnetic interference, and a handful of proximity effects due to more switching with less insulation. Resistance in thinner wires increases heat, and thinner dielectrics provide less insulation for all of the above.

To circumvent these and other issues, including the inability to deliver enough power to turn on all circuits at once, chipmakers have begun checker-boarding logic, with some blocks on, some off. That has a dual benefit of minimizing physical effects and reducing circuit degradation. But it also makes it harder to understand aging patterns across a device and across a system.

Second, the part of a multi-chip device that fails first typically is the bond between multiple die, or between the package and the board. In the past, this was largely due to warpage and the packaging process itself, which put strain on the solder balls. Increasingly, the physical effects that impact a single chip are moving out to the board, as well. Distances between components on a board matter as much as they do on a single chip or in a multi-chip package, and density is increasing for components that previously were kept well separated in the past, often for good reasons.

There are more layers in boards, and much more data moving everywhere. That means more silicon is on than off, and all of this is now functioning as part of a system of systems, where the impact of overheating in one area may affect something completely unrelated from a system architecture standpoint.

Third, not all of the components in these devices are being characterized on an apples-to-apples basis. It’s impossible as an IP provider — regardless of whether it’s hard or soft IP, or whether it’s internally or externally developed — to predict what will be in the proximity of that IP block. While the specs may look similar, they may be susceptible to different physical effects that never were considered in the design process. They also may be used for much longer periods of time, in different markets, than what they were originally design to do.

Tools and various technologies are available today to monitor chip behavior and degradation over time, but the real challenge is to automatically identify and fix these problems. This is well beyond the capabilities of any single company. It will require standards from within the chip industry, as well as different groups within various industries to come together. And it will require more redundancy, more firmware, more security to guard that firmware — all of which are essential to ensuring these devices function properly for as long as they are expected to do so.

We are at the point today where at least we can identify the problems. The next step is to fix them, and that may be more difficult than usual, given all of the various technologies still being developed. This is no longer about the next rev of a smart phone or computer processor or memory. It’s about adding AI and ML into nearly everything as markets demand customized solutions. The concerns may be very different in a car than in an industrial robot or a 5G base station, but failures in any of them can cause significant problems.

Leave a Reply

(Note: This name will be displayed publicly)