The Quest For Perfection

There’s no such thing as zero defects, particularly when you don’t know for sure what needs to be tested.


Demands by automakers for zero defects over 15 years are absurd, particularly when it comes to 10/7nm AI systems that will be the brains of autonomous and assisted driving or any mobile electronic device.

There are several reasons for this. To begin with, no one has ever used a 10/7nm device under extreme conditions for any length of time. Chips developed at these nodes are just starting to be manufactured, and field trials are showing all sorts of issues ranging from noise to increased process variation. While these issues can be design around, the amount of guard-banding required is still a research topic. How fast, for example, do complex materials break down under the hood of a car operating in the desert sun? Even dashboards crack over time in places like Arizona and Egypt, while they will last indefinitely in more moderate climates.

One of the biggest costs in data centers is chilling racks of densely-packed blade servers and storage devices. The more those servers are utilized, the higher the costs. If they get too hot, they fail over to other server racks that are operating within acceptable thermal parameters. That’s not an option in a car, a drone, a robot, or some industrial operations. And no amount of simulation is going to show all of the possible variables, such as a certain order of events that can cause battery fires or unexpected signal interruptions. Even big companies with very deep pockets make mistakes.

Second, wafer and die test, as well as various types of physical inspection, have proven very effective at picking up physical and functional anomalies. But as devices become exponentially more complex due to increasing density and more heterogeneous components scattered around a die—or multiple die in a package—confidence levels will begin to soften. The alternative is to extend the amount of time needed to increase coverage for both, but that can throw the entire manufacturing process into low gear.

The majority of the quality control operations in chip manufacturing have been operating on the assumption that a few percentage points of coverage don’t matter in a chip that will be changed out in two or three years, and where an occasional reboot is acceptable. That’s not a viable option in a car or an industrial robot’s guidance system. So even if these devices can be simulated to withstand high temperatures, that’s only one part of the system. There is no way to test everything up-front all at once, which puts the onus on design flexibility.

Over-the-air updates are considered essential in many of these markets. But over-the-air updates that address imminent failures because something has been missed are not going to happen at 65mph where there is limited cell coverage in a car with no steering wheel.

The alternative is adding more redundancy into systems. But carmakers are backing away from redundancy because it’s expensive, relying instead on other systems to perform limited tasks in case of a failure. It’s not clear if those systems are being tested and inspected with the same coverage and failure scenarios as the main drive-train or guidance systems. In industrial settings, robot failures can cause havoc. Tesla CEO Elon Musk blamed production glitches on faulty robots.

Third, particularly for AI (and ML/DL) systems, it’s not clear what exactly will need to be tested to ensure proper operation, and the more machines get involved in the training of other machines, the worse this problem can become. AI inferencing fits into an acceptable distribution, but understanding how faults affect those distributions is just beginning—meaning smart people are just now starting to consider this stuff. The fact that these systems seem to work well enough is a major step forward, because AI has been around since before Moore’s Law. Until a decade ago, computer scientists were still writing algorithms by hand.

Much has changed since then. AI systems have been proven to work, and they are in use in many places. But how do we measure defects in systems where results are distributions, and where each system is expected to evolve differently from the others? At this point this appears to be more art than science. And are aberrations additive if systems train other systems?

These are all daunting questions, and the challenges ahead for test, inspection and design for test and inspection, are moving into areas where the semiconductor industry has never ventured before. There are so many unknowns that it’s impossible to identify all of the possible scenarios and corner cases.

Only time will tell if the industry gets this right, and right now it’s too early to even tell how much time will be needed to make that call.

Leave a Reply

(Note: This name will be displayed publicly)