Logic Chip, Heal Thyself

Will self-healing chips put longevity and safety into automotive-bound electronics?


If a single fault can kill a logic chip, that doesn’t bode well for longevity of complex multi-chip systems. Obsolescence in chips is not just an industry ploy to sell more chips. It is a fact of physics that chips don’t last more than a few years, especially if overheated, and hit with higher voltage than it can stand.

The testing industry does a great job finding defects during manufacturing and predicting what will fail. And capability of testing will only improve. Memory has ways to deal with faults that can keep a memory chip running effectively. But when will logic portions of a chip be able to heal itself?

There are several ways to “heal” chips. One is to literally reflow the metal in wires, which is currently in research at universities and some companies. The second is to use margin built into the chip with extra circuits, which is the approach used for error-correcting code (ECC) memory. Margin can range from fully redundant chips or circuits to generic transistors. The third is to override defective hardware with a software update, which is the approach that Apple took in 2010 with an antenna glitch in the iPhone 4.

R&D sci-fi — biological-style healing
The biological type of self-healing goes beyond the comfort level of industry to talk about. “Self-healing in chips is sometimes used to describe phase-change memory. That’s still in infancy compared to mature DRAM/NAND memory technologies. ‘Self-healing’ is also used in the academic world to describe chips that can truly survive catastrophic damage by leveraging suitable redundancies in their architecture. From what we know, it’s still an active research topic in the academic world but hasn’t got any commercial traction so far,” said Subodh Kulkarni, president and CEO, CyberOptics.

It is still too early for a commercialized biologic-type healing. “We’re not seeing a lot of that,” said Doug Elder, vice president and general manager of the semiconductor business unit at OptimalPlus. “We do see bits and pieces of it in the early testing phases and prototype phasing, and in R&D — customers collecting data — but we haven’t seen them in volume, per se. And if we have, we’re not made aware of the fact that they were self-healing chips and they had different test characteristics.”

Chips with wires that automatically regenerate material or rewire around damage, as humans can after an injury, have made some strides in R&D. Getting metals to ‘heal’ by regenerating usually means applying some type of heat. Also, carbon nanotube microcapsules that break and repair components when needed is another concept being tested. Rewiring around damage may still be the most practical application. Both solutions and hybrids thereof are being tried, and governments obviously have an interest. Among the solutions:

  • Applying heat. When NASA and KAIST (Korea Advanced Institute of Science and Technology) researched making a chip-sized space craft that can survive a fast trip (20 years) to the nearest star, they looked at gate around nanowire transistors that could “self-heal” by applying current/heat to an extra contact on a gate on a transistor that gets zapped by cosmic radiation. Radiation degrades silicon dioxide layers with defects, making transistors eventually leak current. Powering down the craft ever few years and applying heat could heal these defects, according to an IEEE Spectrum article.
  • Bandaging with nanotubes. The idea to use carbon nanotubes stuffed in polymer microcapsules has been around awhile. The microcapsules are triggered to break and electrospray the wounded area with the conductive carbon nanotubes to fill in defect such as a crack in wires. The University of Illinois at Urbana-Champaign worked on self-healing circuits.
  • Controlling errant nanotubes. Nanotubes themselves can be defective and cause problems. They aren’t always conductive, which adds noise to a digital circuit logic circuit but does harm to an analog logic circuit. According to IEEE Spectrum, research from MIT, funded by DARPA, added a layer of carbon nanotubes onto top of already manufactured logic chips and added RRAM to that layer, so each transistor had the layer and RRAM. When a non-conductive nanotube appeared, the RRAM became like a lightning rod to safely redirect current from short circuit. The transistor healed itself.
  • Rerouting in ASICS. California Institute of Technology (CalTech)’s High-Speed Integrated Circuits laboratory, with DARPA funding, put an ASIC on tiny power amplifiers in 2012 for millimeter-wave frequencies and used the ASIC and sensors to watch for broken pathways. The ASIC would then figure out how to reroute most efficiently. After shooting the chips with a laser to take out transistors, the team reported the ASIC quickly figured out the best reroute around destroyed transistors. The ASIC was trained to watch for process variation and transistor mismatch, load impedance mismatch, and partial and total transistor failure, according to the abstract of “Integrated Self-Healing for mm-Wave Power Amplifiers” article published by IEEE.
  • Rerouting in memory. Phase change memory (PCM) changes its architecture or phase when electrons are applied. Eventually the memory cells develop voids in a cell after so many phase changes, which means the cell becomes ineffective. Yale School of Engineering found a way to self-heal any voids that eventually appear in PCM over time. By enclosing the memory cells in metal, electrons have an alternate pathway when a void appears and that keeps the cell functioning.

So far, none of these approaches has shown up in production chips. “My conversations with folks have been, ‘This is interesting.’ The technology is something that people are looking into. It’s still fairly expensive to get into the more commercial applications and the automotive market,” said Elder. “People are thinking about it, but it’s not mainstream yet. We may see it sometime in the future, but again, nobody’s really leaning into it, at least from my perspective, on test floors and in process steps, because it’s still too expensive.”

The technology may be too expensive for a while yet, and this is why margining options are exploding.

Building in margin
Providing a failover option is, at least today, the most viable option for integrated circuits. Failover means automatic switching to a redundant or standby system so that a device continues to operate after a block or transistor or memory bit fails. In the past this approach rarely gained traction as a design approach, except for memory, due to several main reasons:

  • In consumer devices such as smart phones, devices typically were replaced every couple years, so latent defects rarely caused problems. This has changed as advanced node chips are used in cars for extended periods of time. Even major smart phone OEMs now are demanding that chips last four years rather than two.
  • For mission-critical servers, chips were subject to a battery of tests, such as baking them in ovens. That stopped being feasible at the most advanced nodes, where subjecting chips to a battery of “kitchen” tests can destroy them due to higher density and thinner dielectrics. Extra transistors are an alternative, and the added cost is generally absorbed in the cost of a server.
  • Most chips in the past were planar, so leads were exposed and easily attached to a tester. But with advanced packaging, which is necessary to achieve power/performance benefits that used to be associated with scaling, those leads are no longer exposed. And once chips are placed in a package, they no longer can be inspected for defects.

For these reasons and others, such as increasingly strict automotive and industrial reliability standards, there is far more interest in redundancy in all digital circuitry. What’s changing is that instead of making everything redundant, chipmakers are looking at exactly what needs to be backed up, whether generic transistors can be developed for multiple functions, and whether failover needs to happen on the same chip or whether it can be done on another chip in a package.

“We’ve seen this happening on a number of different levels,” said Lee Harrison, marketing manager for automotive test at Mentor, a Siemens Business. “On the hardware test side, there is massive crossover between AI and automotive. Auto OEMs are developing AI systems, and they run a lot of that in system test because there are arrays of processing cores. The goal is to keep the whole thing running and make sure the hardware is correct. What’s different now, though, is the increasing demand on usage with self-driving vehicles. The number of hours these vehicles are expected to be used is increasing significantly. We used to base these tests on 10,000 hours of usage over 10 years. That’s changing. On top of that, an AI chip may have 1,000 or more identical processing cores, and because that’s the brain of a car you need to implement repair for logic.”

The idea is to have spare cores — on average, 10 or more — that can be turned on when needed throughout the lifetime of a chip. That has an added benefit, as well. If one or two cores don’t work after manufacturing, they can be replaced by other cores in the device, which can dramatically improve yield.

“This is incremental soft repair,” said Harrison. “If a core continually fails in a car, then every time you start the car it shifts to another core. This is one of the benefits of centralization of a lot of the processing.”

Even when something does go wrong, though, it’s not always instant or total. Some of this can be spotted through better test coverage during manufacturing, but some of it can only be identified by comparing extremely small differences across data from different tests.

“You may need two tests or more to detect an issue because what you’re looking for is the delta for shifts in output,” said Carl Moore, yield management specialist at yieldHUB. “You can’t just lump all of the data together and figure this out. In an RF power amp, for example, if the current goes up typically the power goes up. But they may not go up at the same rate. With 5G chips, we’re seeing multiple quadrants on a chip and multiple repeated blocks. You can compare results on those blocks, but you really need to look at the specific current in a specific mode and the delta between them. This represents a whole new area of analyzing data to define those deltas. You have to look at data much more carefully and find subtleties in that data.”

Planning for failure
A key factor in self-healing chips involves anticipating failure and modeling it.

“One approach is to characterize a chip at elevated stress conditions, such as higher voltage, pressure and humidity and measurements taken in a package, and fit that into the lifetime curve,” said Jon Holt, worldwide fab applications solutions manager at PDF Solutions. “The challenge is that, depending on the application, the wearout time may be different. You can’t foresee that until you collect all of the data.”

And even then, none of this may be obvious in a complex architecture without an understanding of where a specific problem might show up. To make that happen requires both more effective testing at various steps in the manufacturing process, as well as in-circuit analytics to monitor a device as it is being used.

Critical parameters associated advanced ICs (threshold voltages, drive currents, interconnect resistances, capacitor leakages, etc.) will degrade with time, before reaching the tipping point of failure,” said Evelyn Landman, CTO of proteanTecs. “On-chip monitoring allows users to detect real-time degradation in the deployed devices, and shifts the weight from relying purely on accelerated lifetime tests to in-field failure prediction. This presents a new realm of reliability sciencea basis for time-to-failure modeling, based on physics-of-failure mechanisms. By continuously monitoring the system’s critical parameters, users are alerted on faults in advance so they can take corrective action. This becomes a must-have capability as device designs, materials and manufacturing processes become more complex. Service providers can now control maintenance costs by preempting failures, estimate and extend system life, deploy maintenance and repair resources proactively, increase component quality and reduce operational costs.”

It’s not just about predicting failures, though. “It’s also about accelerating root-cause analysis,” Landman said. “Nowadays, 90% of RMAs are NPF (no problem found) because you need to recreate the problem and that’s very difficult to do. With continuous deep data monitoring, you don’t need to recreate the problem because users can see the issue on the spot. The RMA already comes back with an indication of the problem and source. When a system fails in the field, it’s now possible to understand whether the degradation was caused by aging, voltage instability, a power supply issue, overstressed clock DCO, or anything else. Armed with that information, you can go back to the test data and see if there is a commonality with other chips. Then you can feed back the insights to production to prevent systematic issues, to HTOL (high-temperature operating life) to improve the reliability testing, or to the design to make any necessary adjustments.”

This is the whole idea behind parametric testing, which has become a key part of 5G testing because much of the chip cannot be accessed for testing during manufacturing.

“In the specification document of a device, like an RF front-end module or a beamformer, you know that device will have a guaranteed specification around certain performance characteristics,” said David Hall, chief marketer at National Instruments. “You know the output power and the modulation quality, and for a lot of standards-based measurements for a 5G device it will have metrics and performance specifications that are specific to the standards body-type definition. So, for example, the 3GPP will specify for a 5G device that the modulation quality must be better than X% or the emissions into an adjacent channel must be better than X dB. If you’re building a device like a power amplifier, you might specify that the emissions in a harmonic channel may only be X number of dB. All of these are specific measurements.”

Any variation from that set of specs is observable, and should be able to trigger adjustments in a system, whether that is failover for transistors or cores or recalibration of sensors or other analog devices to account for drift. But it’s much more difficult in custom designs that are not tied to such a clear spec.

“There are more variables than in the past,” said Raanan Gewirtzman, chief business officer at proteanTecs. “In the past, you fixed a problem and if the data changed then you could change the data. Now, with learning software, that cannot be sufficiently tested. You can go through something the way it’s supposed to be used, but there is more variability. The system keeps morphing, and it’s different from one computer to another. That plays very strongly to continuous monitoring. You continue to monitor parametrics and do it periodically so that you know in advance when something has changed and whether it has shortened the end of life.”

Evolution of self-healing circuits
The idea of self-healing chips began with error-correcting memory, which dates back to 1958 when IBM introduced parity control into its mainframes. If one memory bit failed, no data would be lost and an extra bit that was built into a device could replace the defective bit. This can be cause by a latent defect, or it can be caused by a stray alpha particle that can “flip” a bit. There also are bit-flipping cyber attacks, which adds a new twist to this issue.

In the past, single-event upsets caused by these radioactive particles was more theoretical than real. But as density in memory and other circuitry has increased, so have the chances of something going wrong. This can be due to an alpha particle, but it also can be due to process variation during manufacturing, or impurities in gases used in production or even the silicon wafer. At 10/7/5nm, this can cause significant reliability problems.

Yield is another factor. When Sony introduced its Playstation 3, it utilized a Cell microprocessor developed with IBM and Toshiba that included 8 co-processors. At the time, industry sources said only 7 of those co-processing elements were required, but that 8 were included to ensure sufficient yield at the latest process node.

There are multiple ways to avoid problems in the field. The first step is finding out the cause of a problem so that appropriate action can be taken. But just making everything redundant doesn’t work in most applications, due to cost, weight and power considerations. So in the short-term, the goal is to get much more granular about what portions of a device cannot fail and figure out a ways of preventing immediate problems.

The longer-term approach is truly self-healing logic circuits, and so far that cost is still too high for most applications. But like everything else in semiconductors, economies of scale will drive down those costs, allowing these technologies to become more widely implemented.

“Having a chip be able to fix itself real time is still too expensive to implement in a commercially available product,” Optimal Plus’ Elder said. “It’s easier just to do, I hate to say it, a recall.” The FRU (field replaceable unit) in automotive, which a mechanic uses to replace the failing unit, will be around for a while longer. “The cost of the FRU is such that it still doesn’t make sense to make it able to fix itself. People are playing with it. But it’s more in the memory space and image sensor space, as the cost per pixel is becoming much more expensive than it was historically.”


Leave a Reply

(Note: This name will be displayed publicly)