Safety Islands In Safety-Critical Hardware

Creating a reliable place to manage critical functions when a design contains a mix of ASILs.

popularity

Safety and security have certain aspects in common so it shouldn’t be surprising that some ideas evolving in one domain find echoes in the other. In hardware design, a significant trend has been to push security-critical functions into a hardware root-of-trust (HRoT) core, following a philosophy of putting all (or most) of those functions in one basket and watching that basket very carefully. A somewhat similar principle applies for safety islands in safety-critical designs, in this case a core which will continue to function safely under all possible circumstances. The objective is the same – a reliable center for managing critical behavior, though from there the implementation details diverge.


Source: Arm and Arteris IP, Arm TechCon 2019

Most of us are familiar with ISO 26262 and the expectations that specification drives in hardware design. Since the point of compliance is to ensure safety in the SoC, why isn’t everything in the safety island? That’s harder than you might think when the design must meet ASIL D, the highest integrity level required for safety-critical functions. Most safety-aware IP today are designed to support ASIL A or B, good enough for ADAS support but not good enough for critical safety functions such as automatic braking, self-steering or other autonomous functions. Increasing the integrity levels for these IP subsystems often doesn’t make economic sense for the IP providers or internal IP design teams since they must support a wide range of markets and use-models which don’t demand that level of investment (and cost to the integrator).

Similarly, some IP subsystems may have no safety support. How do you make a neural-net processor safety-compliant? In principle you could use some of the standard failure mitigation tricks but where would you put them? Standard FMEA and FTA approaches to failure modeling don’t have clear applicability to nondeterministic neural nets, other than possibly doubling the size and power consumption of the net by mitigating everywhere. And that would have questionable value when we don’t yet know how to deal with the bigger problems of probabilistic detection and pixel-level spoofing.

Instead, teams designing to achieve an ASIL D system take a different approach, accepting that designs will contain a mix of ASIL support levels. Pulling this off requires modularity in safety management at the integration level. For example, it should be possible to isolate any given IP or subsystem for in-use testing using LBIST or MBIST. Resources such as memory must be independent between domains wherever possible so that failure in one cannot corrupt another. And QoS and isolation guarantees are needed so that a misbehaving domain cannot lockup the whole system.

Which all makes sense, but who’s in charge? What machinery is going to recognize problems and force domain testing and/or flag to the larger system that the driver needs to grab the steering wheel and pull over to the side? That’s the function of the safety island. This is designed to fault-manage and control the rest of the system, to enable recovery of complex domains within the chip and signal failures to external systems.

To do this, the safety island first must have maximum freedom from interference from the rest of the system, so it needs dedicated compute, memory and I/O resources, its own clock nets and power grid, and to the greatest extent possible it needs low complexity so that internal (to the island) failure modes can be well-understood and well-mitigated. Overall it must continue to work even if the rest of the functionality on the SoC falls apart.

Dream Chip Technology has implemented an ADAS reference design SoC and complete system, now in production through derivatives in several applications. Their safety island is a good example of what I have described here: A dual-core lockstep Arm Cortex-R52 processor with dedicated tightly coupled memory and a dedicated interrupt controller to monitor faults from the rest of the SoC. A private bus connects to private non-volatile memory and a dedicated watchdog timer; a separate bus connects to the main system NoC.

In a system with heterogenous ASIL support, the system NoC takes on a lot of the responsibility for assuring safety at the system level. Arteris IP provides this primarily through 3 functional safety mechanisms: Timeout checking, isolation and end-to-end ECC protection. Timeout checking can help detect transient faults, where a timeout prompts a request to resend but also flags an interrupt to the safety island.

When an error occurs in an IP subsystem, one response might be to isolate the block and run LBIST or MBIST as appropriate on that subsystem (which the island may also choose to do on some regular schedule). The safety controller for the NoC can manage this isolation directly by disconnecting power to the socket connected to that IP. Again, a BIST failure will flag an interrupt to the safety island.


The NoC interconnect provides a safety-critical function by isolating IP subsytems for periodic checking and to avoid system-level corruption. Source: Arm and Arteris IP, Arm TechCon 2019

For communications between IP subsystems and the NoC(s), ECC provides essential checking and correction, especially for IPs with limited or no safety support. Once again, detected errors will flag interrupts to the safety island.

How the safety island processes this error input from the system is of course a function of how the semiconductor vendor, Tier 1 or OEM chooses to manage each type of information. What they can be assured is that no error or timeout will go unnoticed and that any corruption in other parts of the SoC will not corrupt the functioning of the island or its ability to log and pass on that information for higher level decision-making and response.



Leave a Reply


(Note: This name will be displayed publicly)