Ensuring safety-critical systems continue to function is complex but necessary.
With the electrification of automobiles, it’s not enough to test the new electronics thoroughly at the end of the manufacturing process. Safety standards now require that tests be performed live, in the field, with contingency plans should a test fail.
“We see clear demand from the automotive semiconductor supply chain for design functionality specifically aimed at in-system monitoring,” said Dennis Ciplickas, vice president of advanced solutions at PDF Solutions. That monitoring can involve both structural test and performance monitoring.
Safety mechanisms must include a plan for running tests during operation. While start-up and shut-down are important events for triggering tests, some must run while the vehicle is in operation. Exactly what runs when isn’t explicitly specified in any standard, but such a plan must be part of the safety mechanisms that become part of the certification process.
Why testing and monitoring are needed
Lives are at stake every time a vehicle hits the road. The ISO 26262 standard lays out general requirements for how safety must be assured, and, unlike most electronic systems, that includes regular chip testing for the life of the vehicle. “If you look across the spectrum, we have structural testing, which we’ve been doing for years,” said Lee Harrison, automotive IC test solutions manager at Siemens EDA. “More recently, we’ve moved into doing that in-system.”
For some systems in a vehicle, this is essential. “The ISO 26262 standard requires that functional safety of electrical and/or electronic systems in serial production road vehicles is regularly checked with self-test throughout the car lifecycle,” said Fabio Pizza, automotive business development manager at Advantest.
There are three main reasons for this testing:
Exactly how a device will perform or age at the lowest levels is a function of the variation that can occur between devices. Different devices will operate differently, and test and reliability statistics may or may not accurately reflect a specific device.
The era of individually owned cars, parked for long periods and otherwise commuting, running errands, and going out some nights, is changing. “We’re moving towards 24/7 cars that are being used for ride-sharing, and that’s going to totally change the wear-out paradigm,” said Rathert.
What kinds of tests to run
ISO 26262 doesn’t name specific tests that must be run. Instead, it speaks more generally to “safety mechanisms” that are up to the development teams, tier-1 suppliers, and OEMs to agree on. That leaves them the flexibility to use whichever tools they deem most effective and to incorporate new ideas as they emerge. “[There’s] logic test and memory test, but it could be other types of standard logic, watchdogs, sensors, or monitors,” said Robert Ruiz, product marketing director, functional safety practitioner, digital design group at Synopsys.
The two main mechanisms for safety checks during operation are testing and monitoring. Testing reflects the same tests used in the manufacturing flow, with some possible modifications or additions. It looks for structural issues. Monitoring, on the other hand, keeps an eye on performance, looking for any anomalies or other indications that safety might be at risk in the near future.
Both tests and monitors can communicate their results to the cloud. But typically, test failures would result in immediate action within the car without the assistance of the cloud. Monitoring data, meanwhile, is likely to be transmitted to the cloud for analytics and tracking. While the car will have many other dedicated sensor components sending data to the cloud, those are watching specific higher-level automotive parameters. By contrast, in-chip monitors watch the circuits. Both are important, but they may be handled and managed differently.
“Monitoring allows for computational synergy between the vehicle and the cloud, enabling both real-time decisions and data analysis on the cloud,” said Gal Carmel, general manager of automotive at proteanTecs. “Our system evaluates the electronics’ integrity in the edge and transmits relevant data accordingly to the cloud for predictive, as well as prescriptive, diagnostics. This not only extends the system’s lifespan but also allows for fast root-cause analysis and OTA debugging.”
While the car will have many other dedicated sensor components sending data to the cloud, those are watching specific higher-level automotive parameters. By contrast, in-chip monitors watch the circuits. Both are important, but they may be handled and managed differently.
Fig. 1: A monitoring architecture involving both in-system reaction and cloud analytics. Source: Siemens
Not all monitors are alike. “We have two types of monitors,” explained Harrison. “We have passive monitors, which we use just to collect data. And then we have reactive monitors where, if we do detect something unexpected, then something like a bus sentry can shut down the bus completely.”
When to run tests
There are three events or times when testing needs to be performed: key-on, in-system during operation, and key-off.
“There’s power-on, when you can start testing to see if the system’s okay,” said Giri Podichetty, product marketing director, digital design group at Synopsys. “During operation, we can do periodic testing, going to check the actual status of the devices. Finally, we’ve got power-off, and then we’ve got more time to do some more [testing].”
Cars already perform some tests when the engine is started. “When you see those dashboard lights at the very beginning, that’s when a lot of initial tests occur,” said Ruiz. While additional tests may take some time, there isn’t much time available, since the driver will expect to begin driving after a few seconds. So for chip tests themselves, the time available might be something like 200 ms.
Much more time is available when the car shuts down. Theoretically, developers talk about having infinite time then. While that’s clearly not literally true – someone could restart the car in a few seconds – the timing budget for key-off tests is on the order of seconds. “When you turn off the car, you can probably wait another 10 seconds,” said Ruiz. “Of course, seconds is a huge amount of time in the semiconductor world.”
A broader set of tests can be run during this key-off phase. “It’s the difference between running a basic checkerboard algorithm on your memories for key-on or a fully featured stress test on your memories for key-off,” noted Harrison.
The circuits tested at key-on may or may not need further testing in-system. This is where the ASIL rating matters. “For things like your infotainment system, you’re fine with your key-on test,” said Harrison. “But when you start to look at the advanced ABS systems, for example, then this becomes more important.”
With robo-taxis and other similar vehicles, moreover, there will be little chance for key-on and key-off tests. “The key-on may happen only once every 10 hours,” explained Harrison. “So you need to be able to run those online tests to make sure everything is still safe.”
The challenge, then, is how to implement tests on circuits even while those circuits are operating. The content and timing of tests depend partly on the safety level. The worst-case, of course, is for ASIL-D, the most stringent rating for the most safety-critical parts of the vehicle. Critical tests will need to be performed on regular intervals, regardless of what the vehicle is doing.
Monitoring isn’t so closely tied to the operational state of the vehicle. “Deep data monitoring allows for 24/7 availability, regardless of the key-on or key-off state of the vehicle,” said proteanTecs’ Carmel. “It operates online and non-intrusively, without disruption to functional operation – as opposed to BIST, for example. At key-on, it identifies and alerts on reliability degradation and safety threats, enabling prescriptive maintenance. At key-off, during pre-scheduled maintenance times, OEMs can physically connect and debug issues to maintain service availability and extend operational lifetime.”
Handling both tests and operation at the same time
Most tests are likely to be disruptive if not managed properly. It’s not necessarily possible to schedule tests, say, while the car is waiting at a stop light, because there’s no guarantee of when that will happen or how long such a stop will last. So arrangements must be made for tests that take place regardless of what the vehicle is doing at the time.
One way of handling this is through redundancy. Instead of a single core operating with its own memory, that core and memory can be replicated. Control can be handed back and forth between them such that, when one core is being tested, the other will be handling the driving duties. Then, it can be reversed to ensure that both sets remain in working order.
This is similar to — but different from — dual-core lock-step operation. In that case, both cores are always operating identically, with the goal of identifying any disagreements between the two as an indicator that there’s a problem. For testing, however, the two sets will not be in lock-step. Instead, they’ll be tag-teaming so that both testing and operation can proceed seamlessly.
This redundancy provides operational margin for testing. “Some architect may decide that, ‘I have four processors, so in a round-robin fashion, I can take one offline and test it. I still have three running and am still able to function pretty well,’” said Ted Chua, product marketing director in the Tensilica IP group at Cadence.
The need for redundancy isn’t limited to functional circuits. It’s also necessary to test the test circuitry. “The testing also needs to have redundancy,” noted Chua.
Memory can be partially tested between accesses — so-called “transparent” testing. “We do a non-destructive memory test in a time-slicing form so as not to take up too much time,” said Synopsys’ Podichetty. More expansive memory built-in self-test (MBiST) tests can be performed if the memory is taken offline.
For logic tests on SoCs, the testing capability largely makes use of the design-for-test (DFT) infrastructure that’s already on the chip for the purposes of manufacturing test. Some circuit modifications may be necessary to add more channels for better visibility, but such changes often are intended to reduce test time, helping to compensate for the extra silicon area required.
In-system logic testing typically involves logic built-in self-tests, or LBiST. This includes the use of a seed vector that is run through the tests, the results of which are combined into a signature. Verification of that signature constitutes a pass/fail signal for that part of the system. How big that LBiST domain is can depend on the time available for testing.
Fig. 2: An example LBIST block, shared with embedded deterministic test (EDT). Source: Siemens
“One could have a tight loop that tests something very quickly as you’re driving down the road, versus an LBiST wrapped around a larger patch of the circuitry, which would be done at key-on or key off,” noted Rob Knoth, product management director, digital and signoff group at Cadence.
Software also can be invoked to run tests on hardware. “This can include a portion intended to check that the device booting sequence and self-test work properly, and to detect possible issues in the interaction with the application circuitry (sensors, interfaces, cameras, buses, etc.),” said Advantest’s Pizza. That testing may be intrusive, however, and its timing must be carefully planned.
How long testing takes
Overall, the timing for the three phases of test, including the in-system intervals, will be set by the functional safety manager for each application. “Most of this comes from the functional safety requirements in terms of how much time we’ve got for each one of those tests,” said Podichetty.
Adds Harrison: “We do a lot of work with customers to minimize the time that we’re using for in-system tests. Because of the duration of some of the tests and the function of some of the SoCs, it does make sense to divide that test up into chunks. So you may split it into 10 pieces and run a piece every couple hundred milliseconds.”
Of course, test time can become a competitive metric between companies offering test IP. “We’ve been able to bring the runtime of BiST test down by a factor of 10 or 20, which really helps to achieve a complete test within a single time interval, as opposed to having to split that test up,” said Harrison.
Efforts to reduce manufacturing test time through better compression also can accrue to in-system tests. “Your manufacturing test time is much shorter, but since we’re duplicating and using that same structure for LBiST, all of a sudden, your LBiST runs much faster in the field,” said Knoth.
When to test also can be a dynamic consideration rather than a tight schedule. “If you are in the middle of emergency braking, you don’t go into a test for even a split second,” noted Chua. That’s not so hard if the test timing comes during an emergency. “But what if you are in the middle of testing and then you come to this emergency or dangerous situation? You care about safety, that’s why you go do tests, but you don’t want to do the testing and then jeopardize safety.”
Responding to failed tests
The test timing must also take into account the kinds of responses that might be needed if a test fails. If tests pass, then operation continues as before. But if a test fails, then the safety plan has to account for the response. Regardless of what the trigger is, response means moving the car into a safe state. Exactly what that state is becomes part of the safety plan.
If a fault is detected, there is a time budget allowed for moving the vehicle into that safe state. “The spec says that if there’s a fault, there’s a certain amount of time detect it, a certain amount of time to record it, and then to do something about it” said Ruiz. The time between checking for one fault and then another fault is referred to as the fault-tolerant time interval (FTTI).
The FTTI starts when a test failure is detected and includes any response necessary to put the system into a safe state before returning to detect the next fault. At this level, it’s assumed that such a test failure won’t be catastrophic, but will allow for some ability to remediate the situation.
Fig. 3: The FTTI includes three intervals for detecting a fault, reacting to the fault if detected, and then entering a safe state before the next test is performed. Source: Cadence
It’s similar to what happens when the “check engine” light comes on. “You get a warning light on your dashboard that your braking ECU is out,” explained Rathert. “It’s good news that you got a warning light and that it didn’t fail while you needed it. But you still have a bad part in your car, and you have a trip to the service center in your future.”
Knoth put it a different way. “So you detected that ‘core two’ was bad. Now I’m going to tell the software that’s running my system, ‘Don’t use core two anymore. You get to use only cores one, three and four.’ The system registers that and goes into the new safe state. Maybe it doesn’t let you use full ADAS anymore, maybe it’s going to use just lane avoidance or something like that.”
There is no one mandated FTTI for the entire vehicle. “Depending on what the device application is, there could be a different FTTI for different modules,” said Chua.
There has to be a central point of control within the chip to manage all of the safety- and test-related events. This may be referred to as the “safety island” or the “safety manager.” The former term is sometimes confused with another notion that identifies safety islands as segregated areas of the physical layout.
The safety island is a processor subsystem tasked with managing the various safety mechanisms in the vehicle. The schedule that the tests must follow becomes part of the test and safety architecture and is established early on. But the execution of that plan – scheduling tests and handling results – is done in real time by the safety manager. “That safety manager can look for red flags, and then it can quickly put the device into a safe state,” said Harrison. It may operate at the chip or subsystem level.
Fig. 4: A safety island connected to an automotive IC. Source: Siemens
Test and monitoring are only part of ensuring safe operation
While there is a focus on in-system test as a new component of a safety plan, it’s really the last backstop of a process that starts early in design and manufacturing.
EDA tools must be aware of functional safety. Tests and safety circuits often don’t fit into what optimization tools might otherwise expect, so the tools must make allowances under the guidance of the safety plan. “You don’t want to do a modification to make the vehicle safer and then have a subsequent optimization look at that and say, ‘Well, I know how to save a bunch of area here. I’ll get rid of this redundancy,’” said Knoth.
Having EDA tools and IP pre-certified — particularly test IP like design-for-test (DFT) — can smooth the certification process for users of those tools and IP. “It really helps with the adoption of these technologies if we already have the ISO certification,” noted Harrison.
Manufacturing inspection, and the long-loop feedback from observed operation back to manufacturing, is another important way to improve safety over time. The higher the quality coming out of the fab and packaging operations, the less likely it is that in-system tests will fail. Even aging can be mitigated if physical characteristics can be correlated to accelerated aging. Everything will still age, but chips that age faster can be eliminated from the supply chain.
“We [the inspection industry] are trying to improve yield and let fewer things make it to test as the final line of defense,” said Rathert. “And the test guys are focusing on how they can improve their test game. As we peel this onion and look at this problem, number one is stopping defects from happening. Number two is escapes, and a portion of those escapes are test coverage things and a portion of what we call latent defects.”
The latter do not reveal themselves until later, possibly after exposure to certain environmental conditions. By definition, they won’t be caught at manufacturing test.
“There are two ways we deal with this,” Rather said. “One is process control, where we stop making defects. Number two is screening – look for wafers or individual dies that look out of whack and that we don’t want to go into the supply chain. And then, on to wafer level probe, and then singulation and final test. We’re trying to insert all these opportunities as far upstream as we can to stop those bad dies from getting out, so that your built-in self-test always comes out clean.”
Tim Skunes, vice president of research and development at CyberOptics, agreed. “For safety-critical applications such as automotive, where having zero defects is critical, it’s important that effective inspection and metrology processes for SMT/electronics assembly and wafer-level and advanced packaging are implemented to control processes and yields,” he said. “So these processes for 100% inspection need to be in place well ahead of reaching the vehicle.”
End-to-end, from planning through certification
How this testing will be implemented depends on a strategy that must be put into place in the early architectural phases of the development effort. Involving a safety team at this early stage helps to ensure the test plan will meet the needs of the safety plan.
There are also numerous stakeholders to involve. They include chip-development, system-development, and software teams. Such coordination is necessary not only to ensure that all considerations and scenarios have been accounted for, but also that there is no overlap or redundancy between tests performed by different teams. Given a set of tests, who performs which tests when can be decided by all involved early on.
“Whether it’s manufacturing test or in-field test, it requires a much more multi-disciplinary, big-tent approach to decide what you are testing and how you are testing it,” noted Knoth.
Conclusion
Certification happens as with most safety-critical systems. You put together a plan, under the watchful eye of safety experts, and then you demonstrate that you’ve achieved the stated goals at the end. There is no official organization that places its imprimatur on the vehicle after reviewing how all of the testing, monitoring, and other safety mechanisms are handled. At the end of the day, it’s the OEM that does the certification. The expectation is that the OEM is motivated to have a safe vehicle so that its reputation isn’t tarnished and recalls aren’t necessary. Ruiz referred to this as “market-based safety.”
All in all, in-system test joins other functional-safety and security considerations as new requirements for these rolling systems of systems. As with any design, there will be tradeoffs within and between these considerations as auto makers compete to provide their customers with the best and safest experience.
Knoth summed up the benefits of a holistic approach to in-system testing: “The teams that are the most successful in the market take advantage of all of the inputs from different stakeholders to come up with elegant solutions that allow them to have lower overhead, higher-quality test, and achieve a better end product than if they stayed in their own silos and hocked problems over the wall and said, ‘Well, let the test guys figure this out.’”
I know what you mean. I’m the Test Engineer on the other side of the wall. Good point.