How do you manage when a subsystem needs to reboot in an autonomous vehicle?
When we in the semiconductor world think about safety, we think about ISO 26262, FMEDA and safety mechanisms like redundancy, ECC and lock-step operation. Once we have that covered, any other aspect of safety is somebody else’s problem, right? Sadly no, for us at least. As we push towards higher levels of autonomy, SAE levels 3 and above, we’re integrating more functionality into our SoCs, much of it involved in complex decision-making. Problems will happen in these complex systems, whether through transient faults or other causes, and not all of these can be corrected on-the-fly by those safety mechanisms I just mentioned. Sometimes, you have to reboot, the same way you reboot your phone, computer, or even your state-of-the-art TV when these misbehave.
But “reboot” in a system for an autonomous or even semi-autonomous car is a much more serious enterprise. When there’s no driver or the nominal driver isn’t paying attention, you can’t just shut a system down and restart it; all kinds of bad things might happen before it recovers. It becomes essential to isolate parts of the faulty system while depending on the rest to continue in an emergency mode, meantime rebooting those isolated components to correct the problem, or until the human driver takes over. If there’s no human driver or the driver is not expected to be actively engaged, as in level 3 automation, then the vehicle can limp to a safe state, such as pulling to a stop at the curb.
How emergency operation should function is of course a decision for the car (or possibly Tier 1) architects, but the SoC architecture must support their needs, in detecting faults, isolating function blocks and supporting reboot of those blocks. The detection mechanism will look familiar, like voting systems, but that familiarity can be slightly misleading. Each follows some form of M out-of N redundancy (MooN) with roots in industrial automation, the most common of which can look slightly strange from a logic triple-modular-redundancy (TMR) point of view. 1oo2D is one common option where one input must be functioning correctly as determined by diagnostics (the D in 1oo2D) in order to continue safe operation (or possibly switch to emergency operation), or 2oo2D where both inputs must be functioning correctly, otherwise triggering a switch to emergency operation. 2oo3 is TMR, but again, on signals of correctness of the inputs, not simply on logic levels.
Where does this need arise? Anywhere there’s a complex, probably AI-supported SoC in the car. Even in today’s cars with ADAS, that’s a lot of places – video, radar and ultrasonic sensors, for example. As levels of autonomy increase there will be even more sensors, with significant complexity within each sensor’s digital logic to “make sense” of the real-world analog inputs. Remember that the AI used for recognition in these systems is inherently probabilistic; amazing progress has been made in this direction, but the method is intrinsically non-deterministic and we know it can be fooled once in the real-world environment, particularly thanks to unrecognized biases and incompleteness in training. How can we train a system to react appropriately and safely in every possible situation it might see? One way to compensate for the probabilistic nature of these systems is to build in redundancy between parallel AI subsystems trained on different datasets. Within these subsystems there’s still need for redundancy / voting and possible emergency fallback action based on the situation.
Eventually, each of these subsystems must communicate with a central fusion and decision-making processor. Taken together, that’s many potential points of failure and therefore a lot of semiconductor devices that will need to support fail-safe operation. If just one radar sensor fails, the central brain may decide, based on the redundancy strategy for that sensor, that it should switch to emergency operation in which it might first warn nearby cars to give it a wider berth, then reduce speed and move carefully to a safe location where it can stop.
What’s the relevance of SoC interconnect in helping manage this level of safety? Because all the subsystems in an SoC are connected through a network-on-chip (or possibly more than one NoC), this interconnect becomes central to managing this new level of safety in connecting to MooN-checking logic, observing and checking data exchanged by subsystems, and in isolating subsystems during periods when they experience failures and are required to reboot. In the example above, one SoC may support local capture and recognition based on video, still image and radar streams, an obvious case where fail-operational behavior requires subsystem redundancy, intelligent failure detection, and selective recovery where needed.
The new ISO/PAS 21448 standard, defining Safety of the Intended Functionality (SOTIF), goes beyond the more familiar ISO 26262 to cover areas where these system-level considerations become important. The abstract to the new standard states, “it is intended to be applied to intended functionality where proper situational awareness is critical to safety, and where that situational awareness is derived from complex sensors and processing algorithms; especially emergency intervention systems (e.g. emergency braking systems) and Advanced Driver Assistance Systems (ADAS) …” We should expect to see these new higher-level requirements placing increasing demands on what is expected in SoC design for automotive applications.
Leave a Reply