Redefining Device Failures

Why traditional approaches no longer apply at the most advanced nodes


Can a 5nm or 3nm chip really perform to spec over a couple decades? The answer is yes, but not using traditional approaches for designing, manufacturing or testing those chips.

At the next few process nodes, all the workarounds and solutions that have been developed since 45nm don’t necessarily apply. In the early finFET processes, for example, the new transistor structure provided a huge improvement in current leaking from the gate. Those transistors are becoming leaky again, which costs power and increases the amount of heat that needs to be dissipated. At 3nm, it will be necessary to change out finFETs entirely and replace them with some version of gate-all-around FETs. And even that may only last for another node before the industry moves into nanotube FETs or some other exotic approach.

While GAA FETs are difficult to manufacture, dig down into the metal stack and things get really difficult. The lower-level metal contacts require new materials, none of which has proven perfect or easy to work with. At 5nm and 3nm, critical dimensions are so small, and insulation so thin, that the old approach of just adding margin in order to compensate for variations in manufacturing processes doesn’t work anymore.

That’s compounded by the fact that shrinking features no longer provides the same power and performance benefits of earlier nodes. Architectural changes are required to achieve adequate ROI. Resistance, capacitance and inductance are all problematic at these sizes. In fact, entirely new chip architectures are required to increase the bandwidth and reduce latency, particularly in data-rich applications such as AI/ML/DL.

Taken as a whole, these issues are forcing significant changes in other less visible segments of the industry. As anyone who has ever screamed at a malfunctioning piece of hardware can attest, nothing lasts forever. There are ways to improve quality over time, but proving that up front will be a lot harder at 5/3/2nm because the standard kitchen approaches of baking or freezing a chip don’t work anymore. Aging under extreme stress from heat, cold, vibration and constant use will take its toll on anything.

However, it won’t take its toll on everything equally. So the future may be less about making sure that nothing breaks, which is always a good goal, and more about what to do when something does break. Regardless of the cause of a failure — and that could be anything from a stray alpha particle to a latent manufacturing defect to a nanoparticle that somehow crept into wafer polishing — the key is to identify what’s happening and to be able to prioritize how other circuitry gets used, depending upon the criticality of what’s broken.

To make this work, however, margin will have to be reduced across the board, from design to manufacturing. Design teams need to build-in the logic equivalent of ECC in memory while minimizing the impact on overall power and performance. And on the manufacturing side, processes need to be much more precise than in the past.

Fabs have been so focused on getting working chips out the door at the lowest cost possible that what gets printed on a die may be significantly different from what’s in the original GDSII handoff to the fab. More time needs to be spent on deposition, etch, cleaning, polishing, metrology, inspection, and dozens of other manufacturing steps in order to tighten processes and reduce variation. And more attention needs to be paid to in-circuit monitoring and data analysis, from initial burn-in all the way to end of life for a particular device.

Having a device fail over and continue working isn’t ideal, but it’s a lot better than having a single point of failure that brings down a system. And the only way to get there — short of complete system redundancy — is to shift the industry mindset. Failures will happen. But those failures can be predicted through accurate in-network sensors, and corrective action taken based upon a number of pre-determined scenarios. And at each new process node, given the cost of designs and the end markets in which they will be used, this may be less of an option and more of a requirement.

Leave a Reply

(Note: This name will be displayed publicly)