Functional Safety For Fail-Operational Systems

Rethinking classic concepts is required for autonomous vehicles to be affordable.


Functional safety issues have long been an important part of product development wherever machine operations that are potentially dangerous for humans are carried out unattended. However, in terms of electrical and electronic systems, the need has been limited to a few industries such as medical technology and aerospace. Apart from that, the functional safety concepts were only used for niche products.

This changed fundamentally at the end of the last century when, due to increasing miniaturization, more and more electronic control units were used in everyday devices. In 1999, for example, the first version of IEC 61508 was published. That describes basic, cross-industry concepts of functional safety, which since then must be complied with by all manufacturers of safety-critical products as the state of the art. It was followed by a large number of industry-specific standards that retain the concepts of IEC 61508 and hone them for use in their industries. Certainly, the best known of these standards is ISO 26262, which deals with the functional safety of electrical and electronic systems in the automotive industry.

In the last 10 years in particular, the automotive industry has been seen as an innovation driver for many developments, and it has also provided some innovations in the area of functional safety. In the future, it will be a main driver of the development away from fail-safe toward fail-operational systems. Of course, these have been in use for years. For example, the flight control surfaces of an airplane must continue to function even if the electronics fail. This is often achieved through redundancy, be it classic redundancy or functional redundancy (diversity), in order to avoid a common cause failure.

For a long time, this was an adequate solution in the car, as the electronics are in a safe state, i.e., as a rule “off” and the driver could still keep control of the vehicle using purely mechanical or hydraulic methods. With the advent of self-driving cars, however, this option is no longer available. The only “brain” that controls the car is electronic and can only interact electronically with the rest of the vehicle. Another complication is that classic redundancy is also problematic. When the car drives autonomously, almost all functions are safety-critical. Designing each of them redundantly would increase weight and costs so much that most consumers would no longer be able to afford such a vehicle.

There are several ways to counter this problem. One is to intrinsically design all components to have such high reliability that the risk of a fault, which then directly results in a failure, is low enough to be socially acceptable. Today this is the case with some microcontrollers of well-known manufacturers, but it is generally difficult — especially when it comes to specialized components that are only to be used in one vehicle model.

Another concept is that of partial redundancy. This is the closest to classic redundancy. The idea here is that in an emergency the full range of functions of the individual components does not have to be available, only their basic functions. For example, three functionally different components could be made safe with just one redundancy component. However, the individual components must not fail at the same time because redundancy is only available for one at a time.

The third concept is predictive health monitoring, which has long been used in reliability engineering. Here, environmental and stress parameters, which can increase the fault rate, are recorded and a response can be made if the fault rate is above what can be tolerated according to the standard. This solution would involve very little extra work in terms of additional parts to be installed, because the internal stress parameters often are recorded already for purely functional considerations. Only a few sensors have to be installed to measure the environmental conditions. The main problem is how to make this approach reliable. It is difficult to determine the statistical probability with which the prediction actually applies to the individual case without maintaining large safety margins. This creates the risk of being too careful and constantly exchanging components, which in turn leads to a cost problem.

The last method presented here is damage detection. In this case it is assumed that a defective component only needs to maintain being usable for a very short time, such as a few seconds until the vehicle can be brought to stop at the roadside. This approach makes use of the fact that there is almost always physical damage before problems arise in the electrical signal. For example, a solder joint can be severed by 80% of their contact area before this causes changes in resistance with a sufficient signal-to-noise ratio for detection in the signal. With other methods, such as thermal measurements or surface waves, it is possible to detect the damage much earlier. This means you can react before a fault occurs. The challenge in implementing this concept is to state the probability with which the damage will be found, which is necessary for specifying a fault rate.

The concepts described in this article are only a selection of the possible approaches to making a system fail-operational and only considering the hardware aspects, while keeping costs and material consumption within limits and not having to compromise on safety.

Leave a Reply

(Note: This name will be displayed publicly)