Test Is Becoming A Horizontal Process

Just putting a chip on an ATE machine is no longer sufficient for many applications.

popularity

Semiconductor test, once a discrete part of a well-orchestrated series of manufacturing steps, is looking more like a process that extends from the early concept stage in design to the end of life of whatever system that chip ultimately is used for.

This has important ramifications for safety-critical markets in general, and the semiconductor industry in particular. Both worlds have been increasingly struggling to achieve sufficient reliability ever since the automotive industry decided to swap internal combustion engines for electric motors, and to increasingly replace human drivers with supercomputers.

This transition has turned out to be something of a mind-bender for metal-benders. Nothing lasts forever, and no semiconductor process at 7nm or 5nm is perfect. It’s impossible to get enough purity in materials, consistently ideal mixtures of gases in etch chambers, and flawless handling, probing, cleaning and packaging during any of the various steps required to build a semiconductor.

That some chips can last a couple of decades has been proven. But whether a 5nm chip can do its job without a hitch under extreme conditions for 18 years isn’t proven outside of a simulator. Moreover, the length of time is arbitrary, set by German auto makers as a way of ensuring that cars based on semiconductors and other electronic content are as safe and reliable with mechanical parts.

It’s not clear if that 18-year bar is realistic, either. A car developed today will not be able to communicate or navigate around all of the changes required in a vehicle produced in 2038, likely making it obsolete long before then. In addition, electronics will age in unpredictable ways under extreme conditions, and setting a standard for what is deemed the minimum level of performance after any period of time needs to be set in the context of an entire system, where components that work today may be mismatched in very significant ways after a decade or so of service.

What makes much more sense is building the best chips possible, testing them for what are known parameters of functionality, and then monitoring them closely throughout the life of a system — and, most importantly, in the context of that system. That applies to a car, or a drone, a robot, and even a server inside a large data center.

Test no longer can be viewed as a discrete step done in isolation. It needs to be seen as a series of ongoing checks to monitor how chips behave at any single point in time and what the trend lines look like for that behavior over time for an entire system or subsystem. And when a part or a system’s behavior is erratic or sub-standard, it needs to be replaced, whether that’s after two years in a robo-taxi or 15 years for a vehicle that sits in a garage or a server that is rarely used.

Moreover, whatever tests need to be run have to be understood at the conceptual stage of the design, so that monitors can be inserted into circuits in the right places, and so that leads for testing can be placed on the outside of complex chips and packages to ensure they can be accessed easily and quickly.

The problem with reliability isn’t the chips. It’s the mindset of the people building the systems. Carmakers and other manufacturers need to come to grips with the fact that they are now moving into an entirely different industry and no longer can assess their products the same way they did in the past. At the same time, given the new capabilities of continuous monitoring and AI’s ability to correlate strings of data, the electronics industry needs to begin looking at test as an ongoing way of ensuring reliability across the lifetime of a system, ensuring that entire systems behave as expected and that in-circuit monitors and other tests provide fair warning for anything that can interrupt what is deemed normal behavior.



3 comments

Jeff Lawton says:

The airplane industry has extensive experience with safety in the context of standards like RTCA DO-178C, and that one in particular has quite a bit to say about hardware issues like single point of failure in addition to software testing. However trying to apply a standard like that in the context of video cameras and LIDAR sensors and AI-based algorithms it requires looking at many of its ramifications with a fresh set of eyes. But it is especially “unhelpful” to hear people making broad-brush statements like “the self-driving car industry is IMMUNE to software testing standards because the software the car’s computers is going to be executing HASN’T BEEN COMPLETELY WRITTEN until the vehicle is operating on the highway”. Nobody is “immune” to safety requirements, and anyone who thinks that kind of logic applies to THEM will quickly discover otherwise after they’re directly facing a lawsuit for massive financial damages as the result of loss of life resulting from liability from a defective self-driving automotive product!

Ed Sperling says:

Interesting point. Liability issues for most of the chip industry are brand new, and they’re almost non-existent for software. but now we have regular software updates coming into cars. Patches upon patches can cause problems, and this becomes a huge liability issue in its own right. While the idea of modular software has been around for awhile, which allows you to protect some of the core components, no one has ever applied that to AI systems that adapt to road or environmental conditions or driving habits. What works for one vehicle may cause problems in another after those systems begin to optimize, and at this point it’s almost impossible to go back and search through those algorithms to figure out what went wrong.

Jeff Lawton says:

Actually versions of DO-178 have been around since 1991 – 29 years? It’s quite understandable why an industry as nimble as autonomous vehicle software development would not be eager to be burdened with the likes of DO-178C, after all there are hardly even any object-oriented languages that qualify as meeting the requirements except Ada (and most savvy developers trying to get through qualification stay within the restrictions of the SPARK subset). At the most restricted testing Level A (failure of the program likely to cause loss of life) of DO-178C, the testing requirement is known as MC/DC, and comparing the tools required to test at that level to the commonly known “static analysis” tool would be a lot like comparing a Space Shuttle to a Model T. There have been a lot of projects designed in the past successfully using a graphical tool called SCADE (there are others) to auto generate C language code, and I believe the generated code has been validated to generate MC/DC “safe” code. But in the world of “proven safe” you can’t just go piling Python on Ruby on Perl on Java on C++ on whatever and hope for a miracle the way most projects that incorporate “open source” modules do these days! There’s even an acronym for this, it’s called “SOUP” (for software of unknown pedigree)! Heck even the software tools themselves need to go through validation. And STILL faulty software manages to “slip through” once in awhile. But no, “software safety certification” has been around for awhile, the problem is mostly the expense and delay of certification are hard for this nascent industry to even begin to accept.

Leave a Reply


(Note: This name will be displayed publicly)